Commit Graph

449 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
0773736e48 Merge 5.10.104 into android12-5.10-lts
Changes in 5.10.104
	mac80211_hwsim: report NOACK frames in tx_status
	mac80211_hwsim: initialize ieee80211_tx_info at hw_scan_work
	i2c: bcm2835: Avoid clock stretching timeouts
	ASoC: rt5668: do not block workqueue if card is unbound
	ASoC: rt5682: do not block workqueue if card is unbound
	regulator: core: fix false positive in regulator_late_cleanup()
	Input: clear BTN_RIGHT/MIDDLE on buttonpads
	KVM: arm64: vgic: Read HW interrupt pending state from the HW
	tipc: fix a bit overflow in tipc_crypto_key_rcv()
	cifs: fix double free race when mount fails in cifs_get_root()
	selftests/seccomp: Fix seccomp failure by adding missing headers
	dmaengine: shdma: Fix runtime PM imbalance on error
	i2c: cadence: allow COMPILE_TEST
	i2c: qup: allow COMPILE_TEST
	net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990
	usb: gadget: don't release an existing dev->buf
	usb: gadget: clear related members when goto fail
	exfat: reuse exfat_inode_info variable instead of calling EXFAT_I()
	exfat: fix i_blocks for files truncated over 4 GiB
	tracing: Add test for user space strings when filtering on string pointers
	serial: stm32: prevent TDR register overwrite when sending x_char
	ata: pata_hpt37x: fix PCI clock detection
	drm/amdgpu: check vm ready by amdgpu_vm->evicting flag
	tracing: Add ustring operation to filtering string pointers
	ALSA: intel_hdmi: Fix reference to PCM buffer address
	riscv/efi_stub: Fix get_boot_hartid_from_fdt() return value
	riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
	riscv: Fix config KASAN && DEBUG_VIRTUAL
	ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min
	iommu/amd: Recover from event log overflow
	drm/i915: s/JSP2/ICP2/ PCH
	xen/netfront: destroy queues before real_num_tx_queues is zeroed
	thermal: core: Fix TZ_GET_TRIP NULL pointer dereference
	ntb: intel: fix port config status offset for SPR
	mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls
	xfrm: fix MTU regression
	netfilter: fix use-after-free in __nf_register_net_hook()
	bpf, sockmap: Do not ignore orig_len parameter
	xfrm: fix the if_id check in changelink
	xfrm: enforce validity of offload input flags
	e1000e: Correct NVM checksum verification flow
	net: fix up skbs delta_truesize in UDP GRO frag_list
	netfilter: nf_queue: don't assume sk is full socket
	netfilter: nf_queue: fix possible use-after-free
	netfilter: nf_queue: handle socket prefetch
	batman-adv: Request iflink once in batadv-on-batadv check
	batman-adv: Request iflink once in batadv_get_real_netdevice
	batman-adv: Don't expect inter-netns unique iflink indices
	net: ipv6: ensure we call ipv6_mc_down() at most once
	net: dcb: flush lingering app table entries for unregistered devices
	net/smc: fix connection leak
	net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client
	net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server
	rcu/nocb: Fix missed nocb_timer requeue
	ice: Fix race conditions between virtchnl handling and VF ndo ops
	ice: fix concurrent reset and removal of VFs
	sched/topology: Make sched_init_numa() use a set for the deduplicating sort
	sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()
	ia64: ensure proper NUMA distance and possible map initialization
	mac80211: fix forwarded mesh frames AC & queue selection
	net: stmmac: fix return value of __setup handler
	mac80211: treat some SAE auth steps as final
	iavf: Fix missing check for running netdev
	net: sxgbe: fix return value of __setup handler
	ibmvnic: register netdev after init of adapter
	net: arcnet: com20020: Fix null-ptr-deref in com20020pci_probe()
	ixgbe: xsk: change !netif_carrier_ok() handling in ixgbe_xmit_zc()
	efivars: Respect "block" flag in efivar_entry_set_safe()
	firmware: arm_scmi: Remove space in MODULE_ALIAS name
	ASoC: cs4265: Fix the duplicated control name
	can: gs_usb: change active_channels's type from atomic_t to u8
	arm64: dts: rockchip: Switch RK3399-Gru DP to SPDIF output
	igc: igc_read_phy_reg_gpy: drop premature return
	ARM: Fix kgdb breakpoint for Thumb2
	ARM: 9182/1: mmu: fix returns from early_param() and __setup() functions
	selftests: mlxsw: tc_police_scale: Make test more robust
	pinctrl: sunxi: Use unique lockdep classes for IRQs
	igc: igc_write_phy_reg_gpy: drop premature return
	ibmvnic: free reset-work-item when flushing
	memfd: fix F_SEAL_WRITE after shmem huge page allocated
	s390/extable: fix exception table sorting
	ARM: dts: switch timer config to common devkit8000 devicetree
	ARM: dts: Use 32KiHz oscillator on devkit8000
	soc: fsl: guts: Revert commit 3c0d64e867
	soc: fsl: guts: Add a missing memory allocation failure check
	soc: fsl: qe: Check of ioremap return value
	ARM: tegra: Move panels to AUX bus
	ibmvnic: complete init_done on transport events
	net: chelsio: cxgb3: check the return value of pci_find_capability()
	iavf: Refactor iavf state machine tracking
	nl80211: Handle nla_memdup failures in handle_nan_filter
	drm/amdgpu: fix suspend/resume hang regression
	net: dcb: disable softirqs in dcbnl_flush_dev()
	Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
	Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
	Input: samsung-keypad - properly state IOMEM dependency
	HID: add mapping for KEY_DICTATE
	HID: add mapping for KEY_ALL_APPLICATIONS
	tracing/histogram: Fix sorting on old "cpu" value
	tracing: Fix return value of __setup handlers
	btrfs: fix lost prealloc extents beyond eof after full fsync
	btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
	btrfs: add missing run of delayed items after unlink during log replay
	Revert "xfrm: xfrm_state_mtu should return at least 1280 for ipv6"
	hamradio: fix macro redefine warning
	Linux 5.10.104

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I24dabeba483a0b0123a4e8c10d1a568b11dfb9c8
2022-03-12 13:57:09 +01:00
Frederic Weisbecker
0c145262ac rcu/nocb: Fix missed nocb_timer requeue
commit b2fcf2102049f6e56981e0ab3d9b633b8e2741da upstream.

This sequence of events can lead to a failure to requeue a CPU's
->nocb_timer:

1.	There are no callbacks queued for any CPU covered by CPU 0-2's
	->nocb_gp_kthread.  Note that ->nocb_gp_kthread is associated
	with CPU 0.

2.	CPU 1 enqueues its first callback with interrupts disabled, and
	thus must defer awakening its ->nocb_gp_kthread.  It therefore
	queues its rcu_data structure's ->nocb_timer.  At this point,
	CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.

3.	CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a
	callback, but with interrupts enabled, allowing it to directly
	awaken the ->nocb_gp_kthread.

4.	The newly awakened ->nocb_gp_kthread associates both CPU 1's
	and CPU 2's callbacks with a future grace period and arranges
	for that grace period to be started.

5.	This ->nocb_gp_kthread goes to sleep waiting for the end of this
	future grace period.

6.	This grace period elapses before the CPU 1's timer fires.
	This is normally improbably given that the timer is set for only
	one jiffy, but timers can be delayed.  Besides, it is possible
	that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.

7.	The grace period ends, so rcu_gp_kthread awakens the
	->nocb_gp_kthread, which in turn awakens both CPU 1's and
	CPU 2's ->nocb_cb_kthread.  Then ->nocb_gb_kthread sleeps
	waiting for more newly queued callbacks.

8.	CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps
	waiting for more invocable callbacks.

9.	Note that neither kthread updated any ->nocb_timer state,
	so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.

10.	CPU 1 enqueues its second callback, this time with interrupts
 	enabled so it can wake directly	->nocb_gp_kthread.
	It does so with calling wake_nocb_gp() which also cancels the
	pending timer that got queued in step 2. But that doesn't reset
	CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
	So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now
	desynchronized.

11.	->nocb_gp_kthread associates the callback queued in 10 with a new
	grace period, arranges for that grace period to start and sleeps
	waiting for it to complete.

12.	The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread,
	which in turn wakes up CPU 1's ->nocb_cb_kthread which then
	invokes the callback queued in 10.

13.	CPU 1 enqueues its third callback, this time with interrupts
	disabled so it must queue a timer for a deferred wakeup. However
	the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
	incorrectly indicates that a timer is already queued.  Instead,
	CPU 1's ->nocb_timer was cancelled in 10.  CPU 1 therefore fails
	to queue the ->nocb_timer.

14.	CPU 1 has its pending callback and it may go unnoticed until
	some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever
	calls an explicit deferred wakeup, for example, during idle entry.

This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime
we delete the ->nocb_timer.

It is quite possible that there is a similar scenario involving
->nocb_bypass_timer and ->nocb_defer_wakeup.  However, despite some
effort from several people, a failure scenario has not yet been located.
However, that by no means guarantees that no such scenario exists.
Finding a failure scenario is left as an exercise for the reader, and the
"Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer.

Fixes: d1b222c6be (rcu/nocb: Add bypass callback queueing)
Cc: <stable@vger.kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08 19:09:34 +01:00
Greg Kroah-Hartman
79553fad5c Merge 5.10.102 into android12-5.10-lts
Changes in 5.10.102
	drm/nouveau/pmu/gm200-: use alternate falcon reset sequence
	mm: memcg: synchronize objcg lists with a dedicated spinlock
	rcu: Do not report strict GPs for outgoing CPUs
	fget: clarify and improve __fget_files() implementation
	fs/proc: task_mmu.c: don't read mapcount for migration entry
	can: isotp: prevent race between isotp_bind() and isotp_setsockopt()
	can: isotp: add SF_BROADCAST support for functional addressing
	scsi: lpfc: Fix mailbox command failure during driver initialization
	HID:Add support for UGTABLET WP5540
	Revert "svm: Add warning message for AVIC IPI invalid target"
	serial: parisc: GSC: fix build when IOSAPIC is not set
	parisc: Drop __init from map_pages declaration
	parisc: Fix data TLB miss in sba_unmap_sg
	parisc: Fix sglist access in ccio-dma.c
	mmc: block: fix read single on recovery logic
	mm: don't try to NUMA-migrate COW pages that have other uses
	PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology
	parisc: Add ioread64_lo_hi() and iowrite64_lo_hi()
	btrfs: send: in case of IO error log it
	platform/x86: touchscreen_dmi: Add info for the RWC NANOTE P8 AY07J 2-in-1
	platform/x86: ISST: Fix possible circular locking dependency detected
	selftests: rtc: Increase test timeout so that all tests run
	kselftest: signal all child processes
	net: ieee802154: at86rf230: Stop leaking skb's
	selftests/zram: Skip max_comp_streams interface on newer kernel
	selftests/zram01.sh: Fix compression ratio calculation
	selftests/zram: Adapt the situation that /dev/zram0 is being used
	selftests: openat2: Print also errno in failure messages
	selftests: openat2: Add missing dependency in Makefile
	selftests: openat2: Skip testcases that fail with EOPNOTSUPP
	selftests: skip mincore.check_file_mmap when fs lacks needed support
	ax25: improve the incomplete fix to avoid UAF and NPD bugs
	vfs: make freeze_super abort when sync_filesystem returns error
	quota: make dquot_quota_sync return errors from ->sync_fs
	scsi: pm8001: Fix use-after-free for aborted TMF sas_task
	scsi: pm8001: Fix use-after-free for aborted SSP/STP sas_task
	nvme: fix a possible use-after-free in controller reset during load
	nvme-tcp: fix possible use-after-free in transport error_recovery work
	nvme-rdma: fix possible use-after-free in transport error_recovery work
	drm/amdgpu: fix logic inversion in check
	x86/Xen: streamline (and fix) PV CPU enumeration
	Revert "module, async: async_synchronize_full() on module init iff async is used"
	gcc-plugins/stackleak: Use noinstr in favor of notrace
	random: wake up /dev/random writers after zap
	kbuild: lto: merge module sections
	kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled
	iwlwifi: fix use-after-free
	drm/radeon: Fix backlight control on iMac 12,1
	drm/i915/opregion: check port number bounds for SWSCI display power state
	vsock: remove vsock from connected table when connect is interrupted by a signal
	drm/i915/gvt: Make DRM_I915_GVT depend on X86
	iwlwifi: pcie: fix locking when "HW not ready"
	iwlwifi: pcie: gen2: fix locking when "HW not ready"
	selftests: netfilter: fix exit value for nft_concat_range
	netfilter: nft_synproxy: unregister hooks on init error path
	ipv6: per-netns exclusive flowlabel checks
	net: dsa: lan9303: fix reset on probe
	net: dsa: lantiq_gswip: fix use after free in gswip_remove()
	net: ieee802154: ca8210: Fix lifs/sifs periods
	ping: fix the dif and sdif check in ping_lookup
	bonding: force carrier update when releasing slave
	drop_monitor: fix data-race in dropmon_net_event / trace_napi_poll_hit
	net_sched: add __rcu annotation to netdev->qdisc
	bonding: fix data-races around agg_select_timer
	libsubcmd: Fix use-after-free for realloc(..., 0)
	dpaa2-eth: Initialize mutex used in one step timestamping path
	perf bpf: Defer freeing string after possible strlen() on it
	selftests/exec: Add non-regular to TEST_GEN_PROGS
	ALSA: hda/realtek: Add quirk for Legion Y9000X 2019
	ALSA: hda/realtek: Fix deadlock by COEF mutex
	ALSA: hda: Fix regression on forced probe mask option
	ALSA: hda: Fix missing codec probe on Shenker Dock 15
	ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw()
	ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw_range()
	powerpc/lib/sstep: fix 'ptesync' build error
	mtd: rawnand: gpmi: don't leak PM reference in error path
	KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests
	ASoC: tas2770: Insert post reset delay
	block/wbt: fix negative inflight counter when remove scsi device
	NFS: LOOKUP_DIRECTORY is also ok with symlinks
	NFS: Do not report writeback errors in nfs_getattr()
	tty: n_tty: do not look ahead for EOL character past the end of the buffer
	mtd: rawnand: qcom: Fix clock sequencing in qcom_nandc_probe()
	mtd: rawnand: brcmnand: Fixed incorrect sub-page ECC status
	Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj
	KVM: x86/pmu: Refactoring find_arch_event() to pmc_perf_hw_id()
	KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
	KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
	NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
	ARM: OMAP2+: hwmod: Add of_node_put() before break
	ARM: OMAP2+: adjust the location of put_device() call in omapdss_init_of
	phy: usb: Leave some clocks running during suspend
	irqchip/sifive-plic: Add missing thead,c900-plic match string
	netfilter: conntrack: don't refresh sctp entries in closed state
	arm64: dts: meson-gx: add ATF BL32 reserved-memory region
	arm64: dts: meson-g12: add ATF BL32 reserved-memory region
	arm64: dts: meson-g12: drop BL32 region from SEI510/SEI610
	pidfd: fix test failure due to stack overflow on some arches
	selftests: fixup build warnings in pidfd / clone3 tests
	kconfig: let 'shell' return enough output for deep path names
	lib/iov_iter: initialize "flags" in new pipe_buffer
	ata: libata-core: Disable TRIM on M88V29
	soc: aspeed: lpc-ctrl: Block error printing on probe defer cases
	xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create
	drm/rockchip: dw_hdmi: Do not leave clock enabled in error case
	tracing: Fix tp_printk option related with tp_printk_stop_on_boot
	net: usb: qmi_wwan: Add support for Dell DW5829e
	net: macb: Align the dma and coherent dma masks
	kconfig: fix failing to generate auto.conf
	scsi: lpfc: Fix pt2pt NVMe PRLI reject LOGO loop
	EDAC: Fix calculation of returned address and next offset in edac_align_ptr()
	net: sched: limit TC_ACT_REPEAT loops
	dmaengine: sh: rcar-dmac: Check for error num after setting mask
	dmaengine: stm32-dmamux: Fix PM disable depth imbalance in stm32_dmamux_probe
	dmaengine: sh: rcar-dmac: Check for error num after dma_set_max_seg_size
	i2c: qcom-cci: don't delete an unregistered adapter
	i2c: qcom-cci: don't put a device tree node before i2c_add_adapter()
	copy_process(): Move fd_install() out of sighand->siglock critical section
	i2c: brcmstb: fix support for DSL and CM variants
	lockdep: Correct lock_classes index mapping
	Linux 5.10.102

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ief12c0d77b23f9796b81a8fc3b79ac6589e81dc9
2022-02-23 12:56:37 +01:00
Paul E. McKenney
657991fb06 rcu: Do not report strict GPs for outgoing CPUs
commit bfb3aa735f82c8d98b32a669934ee7d6b346264d upstream.

An outgoing CPU is marked offline in a stop-machine handler and most
of that CPU's services stop at that point, including IRQ work queues.
However, that CPU must take another pass through the scheduler and through
a number of CPU-hotplug notifiers, many of which contain RCU readers.
In the past, these readers were not a problem because the outgoing CPU
has interrupts disabled, so that rcu_read_unlock_special() would not
be invoked, and thus RCU would never attempt to queue IRQ work on the
outgoing CPU.

This changed with the advent of the CONFIG_RCU_STRICT_GRACE_PERIOD
Kconfig option, in which rcu_read_unlock_special() is invoked upon exit
from almost all RCU read-side critical sections.  Worse yet, because
interrupts are disabled, rcu_read_unlock_special() cannot immediately
report a quiescent state and will therefore attempt to defer this
reporting, for example, by queueing IRQ work.  Which fails with a splat
because the CPU is already marked as being offline.

But it turns out that there is no need to report this quiescent state
because rcu_report_dead() will do this job shortly after the outgoing
CPU makes its final dive into the idle loop.  This commit therefore
makes rcu_read_unlock_special() refrain from queuing IRQ work onto
outgoing CPUs.

Fixes: 44bad5b3cc ("rcu: Do full report for .need_qs for strict GPs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23 12:00:56 +01:00
Paul E. McKenney
c34fa06f4b FROMLIST: rcu: Don't deboost before reporting expedited quiescent state
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx
before reporting the expedited quiescent state.  Under heavy real-time
load, this can result in this function being preempted before the
quiescent state is reported, which can in turn prevent the expedited grace
period from completing.  Tim Murray reports that the resulting expedited
grace periods can take hundreds of milliseconds and even more than one
second, when they should normally complete in less than a millisecond.

This patch follows Neeraj's suggestion (seconded by Tim and by Uladzislau
Rezki) of simply reversing the two operations.

Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Joel Fernandes <joelaf@google.com>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Bug: 217236054
Link: https://lore.kernel.org/lkml/20220204232406.814-9-paulmck@kernel.org/
Signed-off-by: Tim Murray <timmurray@google.com>
Change-Id: Ic0f0330b65a0c776a563e03e4b59bf4ab4fbccf6
2022-02-05 02:03:10 +00:00
Peter Zijlstra
af756be29c rcu: Always inline rcu_dynticks_task*_{enter,exit}()
[ Upstream commit 7663ad9a5dbcc27f3090e6bfd192c7e59222709f ]

RCU managed to grow a few noinstr violations:

  vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section
  vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section

Fix them by adding __always_inline to the relevant trivial functions.

Also replace the noinstr with __always_inline for the existing
rcu_dynticks_task_*() functions since noinstr would force noinline
them, even when empty, which seems silly.

Fixes: 7d0c9c50c5 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18 14:04:06 +01:00
Zhouyi Zhou
d3ca78775d rcu: Fix macro name CONFIG_TASKS_RCU_TRACE
[ Upstream commit fed31a4dd3adb5455df7c704de2abb639a1dc1c0 ]

This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should
instead be CONFIG_TASKS_TRACE_RCU.  Among other things, these typos
could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from
memory-ordering bugs that could result in false-positive quiescent
states and too-short grace periods.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-18 13:40:19 +02:00
Paul E. McKenney
ea5e5bc881 rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
[ Upstream commit a649d25dcc671a33b9cc3176411920fdc5fbd98e ]

This commit adds a number of lockdep_assert_irqs_disabled() calls
to rcu_sched_clock_irq() and a number of the functions that it calls.
The point of this is to help track down a situation where lockdep appears
to be insisting that interrupts are enabled within these functions, which
should only ever be invoked from the scheduling-clock interrupt handler.

Link: https://lore.kernel.org/lkml/20201111133813.GA81547@elver.google.com/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15 09:50:28 +02:00
Frederic Weisbecker
e713bdd791 rcu/nocb: Perform deferred wake up before last idle's need_resched() check
commit 43789ef3f7d61aa7bed0cb2764e588fc990c30ef upstream.

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:35 +01:00
Paul E. McKenney
7fbe67e46a Merge branch 'strictgp.2020.08.24a' into HEAD
strictgp.2020.08.24a: Strict grace periods for KASAN testing.
2020-09-03 09:47:42 -07:00
Paul E. McKenney
cfeac3977a rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()
The "cpu" parameter to rcu_report_qs_rdp() is not used, with rdp->cpu
being used instead.  Furtheremore, every call to rcu_report_qs_rdp()
invokes it on rdp->cpu.  This commit therefore removes this unused "cpu"
parameter and converts a check of rdp->cpu against smp_processor_id()
to a WARN_ON_ONCE().

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:40:28 -07:00
Paul E. McKenney
aa40c138cc rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs
The CONFIG_PREEMPT=n instance of rcu_read_unlock is even more
aggressively than that of CONFIG_PREEMPT=y in deferring reporting
quiescent states to the RCU core.  This is just what is wanted in normal
use because it reduces overhead, but the resulting delay is not what
is wanted for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
This commit therefore adds an rcu_read_unlock_strict() function that
checks for exceptional conditions, and reports the newly started
quiescent state if it is safe to do so, also doing a spin-delay if
requested via rcutree.rcu_unlock_delay.  This commit also adds a call
to rcu_read_unlock_strict() from the CONFIG_PREEMPT=n instance of
__rcu_read_unlock().

[ paulmck: Fixed bug located by kernel test robot <lkp@intel.com> ]
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:40:28 -07:00
Paul E. McKenney
3d29aaf1ef rcu: Provide optional RCU-reader exit delay for strict GPs
The goal of this series is to increase the probability of tools like
KASAN detecting that an RCU-protected pointer was used outside of its
RCU read-side critical section.  Thus far, the approach has been to make
grace periods and callback processing happen faster.  Another approach
is to delay the pointer leaker.  This commit therefore allows a delay
to be applied to exit from RCU read-side critical sections.

This slowdown is specified by a new rcutree.rcu_unlock_delay kernel boot
parameter that specifies this delay in microseconds, defaulting to zero.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:40:27 -07:00
Paul E. McKenney
44bad5b3cc rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state.  This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance.  Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.

This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.

Historical note:  To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU.  This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:40:25 -07:00
Paul E. McKenney
f19920e412 rcu: Always set .need_qs from __rcu_read_lock() for strict GPs
The ->rcu_read_unlock_special.b.need_qs field in the task_struct
structure indicates that the RCU core needs a quiscent state from the
corresponding task.  The __rcu_read_unlock() function checks this (via
an eventual call to rcu_preempt_deferred_qs_irqrestore()), and if set
reports a quiscent state immediately upon exit from the outermost RCU
read-side critical section.

Currently, this flag is only set when the scheduling-clock interrupt
decides that the current RCU grace period is too old, as in about
one full second too old.  But if the kernel has been built with
CONFIG_RCU_STRICT_GRACE_PERIOD=y, we clearly do not want to wait that
long.  This commit therefore sets the .need_qs field immediately at the
start of the RCU read-side critical section from within __rcu_read_lock()
in order to unconditionally enlist help from __rcu_read_unlock().

But note the additional check for rcu_state.gp_kthread, which prevents
attempts to awaken RCU's grace-period kthread during early boot before
there is a scheduler.  Leaving off this check results in early boot hangs.
So early that there is no console output.  Thus, this additional check
fails until such time as RCU's grace-period kthread has been created,
avoiding these empty-console hangs.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:40:25 -07:00
Paul E. McKenney
8cbd0e38a9 rcu: Add Kconfig option for strict RCU grace periods
People running automated tests have asked for a way to make RCU minimize
grace-period duration in order to increase the probability of KASAN
detecting a pointer being improperly leaked from an RCU read-side critical
section, for example, like this:

	rcu_read_lock();
	p = rcu_dereference(gp);
	do_something_with(p); // OK
	rcu_read_unlock();
	do_something_else_with(p); // BUG!!!

The rcupdate.rcu_expedited boot parameter is a start in this direction,
given that it makes calls to synchronize_rcu() instead invoke the faster
(and more wasteful) synchronize_rcu_expedited().  However, this does
nothing to shorten RCU grace periods that are instead initiated by
call_rcu(), and RCU pointer-leak bugs can involve call_rcu() just as
surely as they can synchronize_rcu().

This commit therefore adds a RCU_STRICT_GRACE_PERIOD Kconfig option
that will be used to shorten normal (non-expedited) RCU grace periods.
This commit also dumps out a message when this option is in effect.
Later commits will actually shorten grace periods.

Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:40:23 -07:00
Paul E. McKenney
4569c5ee95 rcu/nocb: Add a warning for non-GP kthread running GP code
This commit increases RCU's ability to defend itself by emitting a warning
if one of the nocb CB kthreads invokes the GP kthread's wait function.
This warning augments a similar check that is carried out at the end
of rcutorture testing and when RCU CPU stall warnings are emitted.
The problem with those checks is that the miscreants have long since
departed and disposed of any and all evidence.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:37:54 -07:00
Paul E. McKenney
2130c6b4f6 nocb: Remove show_rcu_nocb_state() false positive printout
The rcu_data structure's ->nocb_timer field is used to defer wakeups of
the corresponding no-CBs CPU's grace-period kthread ("rcuog*"), and that
structure's ->nocb_defer_wakeup field is used to track such deferral.
This means that the show_rcu_nocb_state() printing an error when those
fields are set for a CPU not corresponding to a no-CBs grace-period
kthread is erroneous.

This commit therefore switches the check from ->nocb_timer to
->nocb_bypass_timer and removes the check of ->nocb_defer_wakeup.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:36:06 -07:00
Paul E. McKenney
e082c7b381 nocb: Clarify RCU nocb CPU error message
A message of the form "rcu:    !!! lDTs ." can be tracked down, but
doing so is not trivial.  This commit therefore eases this process by
adding text so that this error message now reads as follows:
"rcu:    nocb GP activity on CB-only CPU!!! lDTs ."

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24 18:36:05 -07:00
Paul E. McKenney
f5ca34643b rcu: No-CBs-related sleeps to idle priority
This commit converts the schedule_timeout_interruptible() call used by
RCU's no-CBs grace-period kthreads to schedule_timeout_idle().  This
conversion avoids polluting the load-average with RCU-related sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Paul E. McKenney
a9352f72d6 rcu: Priority-boost-related sleeps to idle priority
This commit converts the long-standing schedule_timeout_interruptible()
call used by RCU's priority-boosting kthreads to schedule_timeout_idle().
This conversion avoids polluting the load-average with RCU-related
sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Thomas Gleixner
ff5c4f5cad rcu/tree: Mark the idle relevant functions noinstr
These functions are invoked from context tracking and other places in the
low level entry code. Move them into the .noinstr.text section to exclude
them from instrumentation.

Mark the places which are safe to invoke traceable functions with
instrumentation_begin/end() so objtool won't complain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de
2020-05-19 15:51:20 +02:00
Paul E. McKenney
f736e0f1a5 Merge branches 'fixes.2020.04.27a', 'kfree_rcu.2020.04.27a', 'rcu-tasks.2020.04.27a', 'stall.2020.04.27a' and 'torture.2020.05.07a' into HEAD
fixes.2020.04.27a:  Miscellaneous fixes.
kfree_rcu.2020.04.27a:  Changes related to kfree_rcu().
rcu-tasks.2020.04.27a:  Addition of new RCU-tasks flavors.
stall.2020.04.27a:  RCU CPU stall-warning updates.
torture.2020.05.07a:  Torture-test updates.
2020-05-07 10:18:32 -07:00
Paul E. McKenney
7d0c9c50c5 rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks.  Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.

Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU.  This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task.  And that task could switch from one CPU to
another at any time.

This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn.  If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.

With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
43766c3ead rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks
This commit makes the calls to rcu_tasks_qs() detect and report
quiescent states for RCU tasks trace.  If the task is in a quiescent
state and if ->trc_reader_checked is not yet set, the task sets its own
->trc_reader_checked.  This will cause the grace-period kthread to
remove it from the holdout list if it still remains there.

[ paulmck: Fix conditional compilation per kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
66777e5821 rcu-tasks: Use context-switch hook for PREEMPT=y kernels
Currently, the PREEMPT=y version of rcu_note_context_switch() does not
invoke rcu_tasks_qs(), and we need it to in order to keep RCU Tasks
Trace's IPIs down to a dull roar.  This commit therefore enables this
hook.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Lai Jiangshan
5f5fa7ea89 rcu: Don't use negative nesting depth in __rcu_read_unlock()
Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section.  Consider the old vulnerability
involving rcu_read_unlock() being invoked within such a handler that
interrupted an __rcu_read_unlock_special(), in which a wakeup might be
invoked with a scheduler lock held.  Because rcu_read_unlock_special()
no longer does wakeups in such situations, it is no longer necessary
for __rcu_read_unlock() to set the nesting level negative.

This commit therefore removes this recursion-protection code from
__rcu_read_unlock().

[ paulmck: Let rcu_exp_handler() continue to call rcu_report_exp_rdp(). ]
[ paulmck: Adjust other checks given no more negative nesting. ]
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Lai Jiangshan
f0bdf6d473 rcu: Remove unused ->rcu_read_unlock_special.b.deferred_qs field
The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
rcu_read_unlock_special() but never set to false.  This is not
particularly useful, so this commit removes this field.

The only possible justification for this field is to ease debugging
of RCU deferred quiscent states, but the combination of the other
->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
->rcu_read_lock_nesting should cover debugging needs.  And if this last
proves incorrect, this patch can always be reverted, along with the
required setting of ->rcu_read_unlock_special.b.deferred_qs to false
in rcu_preempt_deferred_qs_irqrestore().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Lai Jiangshan
07b4a930fc rcu: Don't set nesting depth negative in rcu_preempt_deferred_qs()
Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section.  Consider the old vulnerability
involving rcu_preempt_deferred_qs() being invoked within such a handler
that interrupted an extended RCU read-side critical section, in which
a wakeup might be invoked with a scheduler lock held.  Because
rcu_read_unlock_special() no longer does wakeups in such situations,
it is no longer necessary for rcu_preempt_deferred_qs() to set the
nesting level negative.

This commit therefore removes this recursion-protection code from
rcu_preempt_deferred_qs().

[ paulmck: Fix typo in commit log per Steve Rostedt. ]
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
e4453d8a1c rcu: Make rcu_read_unlock_special() safe for rq/pi locks
The scheduler is currently required to hold rq/pi locks across the entire
RCU read-side critical section or not at all.  This is inconvenient and
leaves traps for the unwary, including the author of this commit.

But now that excessively long grace periods enable scheduling-clock
interrupts for holdout nohz_full CPUs, the nohz_full rescue logic in
rcu_read_unlock_special() can be dispensed with.  In other words, the
rcu_read_unlock_special() function can refrain from doing wakeups unless
such wakeups are guaranteed safe.

This commit therefore avoids unsafe wakeups, freeing the scheduler to
hold rq/pi locks across rcu_read_unlock() even if the corresponding RCU
read-side critical section might have been preempted.  This commit also
updates RCU's requirements documentation.

This commit is inspired by a patch from Lai Jiangshan:
https://lore.kernel.org/lkml/20191102124559.1135-2-laijs@linux.alibaba.com
This commit is further intended to be a step towards his goal of permitting
the inlining of RCU-preempt's rcu_read_lock() and rcu_read_unlock().

Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
e2f3ccfa62 rcu: Convert rcu_nohz_full_cpu() ULONG_CMP_LT() to time_before()
This commit converts the ULONG_CMP_LT() in rcu_nohz_full_cpu() to
time_before() to reflect the fact that it is comparing a timestamp to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:17 -07:00
Paul E. McKenney
7b2413111a rcu: Convert rcu_initiate_boost() ULONG_CMP_GE() to time_after()
This commit converts the ULONG_CMP_GE() in rcu_initiate_boost() to
time_after() to reflect the fact that it is comparing a timestamp to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
5822b8126f rcu: Add WRITE_ONCE() to rcu_node ->boost_tasks
The rcu_node structure's ->boost_tasks field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
065a6db12a rcu: Add READ_ONCE and data_race() to rcu_node ->boost_tasks
The rcu_node structure's ->boost_tasks field is read locklessly, so this
commit adds the READ_ONCE() to one load in order to avoid destructive
compiler optimizations.  The other load is from a diagnostic print,
so data_race() suffices.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
314eeb43e5 rcu: Add *_ONCE() and data_race() to rcu_node ->exp_tasks plus locking
There are lockless loads from the rcu_node structure's ->exp_tasks field,
so this commit causes all stores to use WRITE_ONCE() and all lockless
loads to use READ_ONCE() or data_race(), with the latter for debug
prints.  This code also did a unprotected traversal of the linked list
pointed into by ->exp_tasks, so this commit also acquires the rcu_node
structure's ->lock to properly protect this traversal.  This list was
traversed unprotected only when printing an RCU CPU stall warning for
an expedited grace period, so the odds of seeing this in production are
not all that high.

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:15 -07:00
Paul E. McKenney
aa93ec620b Merge branches 'doc.2020.02.27a', 'fixes.2020.03.21a', 'kfree_rcu.2020.02.20a', 'locktorture.2020.02.20a', 'ovld.2020.02.20a', 'rcu-tasks.2020.02.20a', 'srcu.2020.02.20a' and 'torture.2020.02.20a' into HEAD
doc.2020.02.27a: Documentation updates.
fixes.2020.03.21a: Miscellaneous fixes.
kfree_rcu.2020.02.20a: Updates to kfree_rcu().
locktorture.2020.02.20a: Lock torture-test updates.
ovld.2020.02.20a: Updates to callback-overload handling.
rcu-tasks.2020.02.20a: RCU-tasks updates.
srcu.2020.02.20a: SRCU updates.
torture.2020.02.20a: Torture-test updates.
2020-03-21 17:15:11 -07:00
Colin Ian King
aa96a93ba2 rcu: Fix spelling mistake "leval" -> "level"
This commit fixes a spelling mistake in a pr_info() message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
8c14263d35 rcu: React to callback overload by boosting RCU readers
RCU priority boosting currently is not applied until the grace period
is at least 250 milliseconds old (or the number of milliseconds specified
by the CONFIG_RCU_BOOST_DELAY Kconfig option).  Although this has worked
well, it can result in OOM under conditions of RCU callback flooding.
One can argue that the real-time systems using RCU priority boosting
should carefully avoid RCU callback flooding, but one can just as well
argue that an OOM is a rather obnoxious error message.

This commit therefore disables the RCU priority boosting delay when
there are excessive numbers of callbacks queued.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
b2b00ddf19 rcu: React to callback overload by aggressively seeking quiescent states
In default configutions, RCU currently waits at least 100 milliseconds
before asking cond_resched() and/or resched_rcu() for help seeking
quiescent states to end a grace period.  But 100 milliseconds can be
one good long time during an RCU callback flood, for example, as can
happen when user processes repeatedly open and close files in a tight
loop.  These 100-millisecond gaps in successive grace periods during a
callback flood can result in excessive numbers of callbacks piling up,
unnecessarily increasing memory footprint.

This commit therefore asks cond_resched() and/or resched_rcu() for help
as early as the first FQS scan when at least one of the CPUs has more
than 20,000 callbacks queued, a number that can be changed using the new
rcutree.qovld kernel boot parameter.  An auxiliary qovld_calc variable
is used to avoid acquisition of locks that have not yet been initialized.
Early tests indicate that this reduces the RCU-callback memory footprint
during rcutorture floods by from 50% to 4x, depending on configuration.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Tejun Heo <tj@kernel.org>
[ paulmck: Fix bug located by Qian Cai. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Qian Cai <cai@lca.pw>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
3d05031ae6 rcu: Make nocb_gp_wait() double-check unexpected-callback warning
Currently, nocb_gp_wait() unconditionally complains if there is a
callback not already associated with a grace period.  This assumes that
either there was no such callback initially on the one hand, or that
the rcu_advance_cbs() function assigned all such callbacks to a grace
period on the other.  However, in theory there are some situations that
would prevent rcu_advance_cbs() from assigning all of the callbacks.

This commit therefore checks for unassociated callbacks immediately after
rcu_advance_cbs() returns, while the corresponding rcu_node structure's
->lock is still held.  If there are unassociated callbacks at that point,
the subsequent WARN_ON_ONCE() is disabled.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
13817dd589 rcu: Tighten rcu_lockdep_assert_cblist_protected() check
The ->nocb_lock lockdep assertion is currently guarded by cpu_online(),
which is incorrect for no-CBs CPUs, whose callback lists must be
protected by ->nocb_lock regardless of whether or not the corresponding
CPU is online.  This situation could result in failure to detect bugs
resulting from failing to hold ->nocb_lock for offline CPUs.

This commit therefore removes the cpu_online() guard.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Jules Irenge
92c0b889f2 rcu/nocb: Add missing annotation for rcu_nocb_bypass_unlock()
Sparse reports warning at rcu_nocb_bypass_unlock()

warning: context imbalance in rcu_nocb_bypass_unlock() - unexpected unlock

The root cause is a missing annotation of rcu_nocb_bypass_unlock()
which causes the warning.

This commit therefore adds the missing __releases(&rdp->nocb_bypass_lock)
annotation.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
2020-02-20 15:58:23 -08:00
Jules Irenge
9ced454807 rcu: Add missing annotation for rcu_nocb_bypass_lock()
Sparse reports warning at rcu_nocb_bypass_lock()

|warning: context imbalance in rcu_nocb_bypass_lock() - wrong count at exit

To fix this, this commit adds an __acquires(&rdp->nocb_bypass_lock).
Given that rcu_nocb_bypass_lock() does actually call raw_spin_lock()
when raw_spin_trylock() fails, this not only fixes the warning but also
improves on the readability of the code.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
3ca3b0e2cb rcu: Add *_ONCE() to rcu_node ->boost_kthread_status
The rcu_node structure's ->boost_kthread_status field is accessed
locklessly, so this commit causes all updates to use WRITE_ONCE() and
all reads to use READ_ONCE().

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
8ff37290d6 rcu: Add *_ONCE() for grace-period progress indicators
The various RCU structures' ->gp_seq, ->gp_seq_needed, ->gp_req_activity,
and ->gp_activity fields are read locklessly, so they must be updated with
WRITE_ONCE() and, when read locklessly, with READ_ONCE().  This commit makes
these changes.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
0e247386d9 Merge branches 'doc.2019.12.10a', 'exp.2019.12.09a', 'fixes.2020.01.24a', 'kfree_rcu.2020.01.24a', 'list.2020.01.10a', 'preempt.2020.01.24a' and 'torture.2019.12.09a' into HEAD
doc.2019.12.10a: Documentations updates
exp.2019.12.09a: Expedited grace-period updates
fixes.2020.01.24a: Miscellaneous fixes
kfree_rcu.2020.01.24a: Batch kfree_rcu() work
list.2020.01.10a: RCU-protected-list updates
preempt.2020.01.24a: Preemptible RCU updates
torture.2019.12.09a: Torture-test updates
2020-01-24 10:37:27 -08:00
Lai Jiangshan
77339e61aa rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
This commit provides wrapper functions for uses of ->rcu_read_lock_nesting
to improve readability and to ease future changes to support inlining
of __rcu_read_lock() and __rcu_read_unlock().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00
Paul E. McKenney
c51f83c315 rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
The rcu_node structure's ->expmask field is updated only when holding the
->lock, but is also accessed locklessly.  This means that all ->expmask
updates must use WRITE_ONCE() and all reads carried out without holding
->lock must use READ_ONCE().  This commit therefore changes the lockless
->expmask read in rcu_read_unlock_special() to use READ_ONCE().

Reported-by: syzbot+99f4ddade3c22ab0cf23@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
2020-01-24 10:27:33 -08:00
Lai Jiangshan
3717e1e9f2 rcu: Clear ->rcu_read_unlock_special only once
In rcu_preempt_deferred_qs_irqrestore(), ->rcu_read_unlock_special is
cleared one piece at a time.  Given that the "if" statements in this
function use the copy in "special", this commit removes the clearing
of the individual pieces in favor of clearing ->rcu_read_unlock_special
in one go just after it has been determined to be non-zero.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00
Lai Jiangshan
2eeba5838f rcu: Clear .exp_hint only when deferred quiescent state has been reported
Currently, the .exp_hint flag is cleared in rcu_read_unlock_special(),
which works, but which can also prevent subsequent rcu_read_unlock() calls
from helping expedite the quiescent state needed by an ongoing expedited
RCU grace period.  This commit therefore defers clearing of .exp_hint
from rcu_read_unlock_special() to rcu_preempt_deferred_qs_irqrestore(),
thus ensuring that intervening calls to rcu_read_unlock() have a chance
to help end the expedited grace period.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00