Patch series "mm/damon: Fix fake /proc/loadavg reports", v3.
This patchset fixes DAMON's fake load report issue. The first patch
makes yet another variant of usleep_range() for this fix, and the second
patch fixes the issue of DAMON by making it using the newly introduced
function.
This patch (of 2):
Some kernel threads such as DAMON could need to repeatedly sleep in
micro seconds level. Because usleep_range() sleeps in uninterruptible
state, however, such threads would make /proc/loadavg reports fake load.
To help such cases, this commit implements a variant of usleep_range()
called usleep_idle_range(). It is same to usleep_range() but sets the
state of the current task as TASK_IDLE while sleeping.
Link: https://lkml.kernel.org/r/20211126145015.15862-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211126145015.15862-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e4779015fd5d2fb8390c258268addff24d6077c7)
Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Ie590ba5fcff22c981d0a7ecae6d8e551160136f3
Changes in 5.10.58
Revert "ACPICA: Fix memory leak caused by _CID repair function"
ALSA: seq: Fix racy deletion of subscriber
bus: ti-sysc: Fix gpt12 system timer issue with reserved status
net: xfrm: fix memory leak in xfrm_user_rcv_msg
arm64: dts: ls1028a: fix node name for the sysclk
ARM: imx: add missing iounmap()
ARM: imx: add missing clk_disable_unprepare()
ARM: dts: imx6qdl-sr-som: Increase the PHY reset duration to 10ms
arm64: dts: ls1028: sl28: fix networking for variant 2
ARM: dts: colibri-imx6ull: limit SDIO clock to 25MHz
ARM: imx: fix missing 3rd argument in macro imx_mmdc_perf_init
ARM: dts: imx: Swap M53Menlo pinctrl_power_button/pinctrl_power_out pins
arm64: dts: armada-3720-turris-mox: fixed indices for the SDHC controllers
arm64: dts: armada-3720-turris-mox: remove mrvl,i2c-fast-mode
ALSA: usb-audio: fix incorrect clock source setting
clk: stm32f4: fix post divisor setup for I2S/SAI PLLs
ARM: dts: am437x-l4: fix typo in can@0 node
omap5-board-common: remove not physically existing vdds_1v8_main fixed-regulator
dmaengine: uniphier-xdmac: Use readl_poll_timeout_atomic() in atomic state
clk: tegra: Implement disable_unused() of tegra_clk_sdmmc_mux_ops
dmaengine: stm32-dma: Fix PM usage counter imbalance in stm32 dma ops
dmaengine: stm32-dmamux: Fix PM usage counter unbalance in stm32 dmamux ops
spi: imx: mx51-ecspi: Reinstate low-speed CONFIGREG delay
spi: imx: mx51-ecspi: Fix low-speed CONFIGREG delay calculation
scsi: sr: Return correct event when media event code is 3
media: videobuf2-core: dequeue if start_streaming fails
ARM: dts: stm32: Disable LAN8710 EDPD on DHCOM
ARM: dts: stm32: Fix touchscreen IRQ line assignment on DHCOM
dmaengine: imx-dma: configure the generic DMA type to make it work
net, gro: Set inner transport header offset in tcp/udp GRO hook
net: dsa: sja1105: overwrite dynamic FDB entries with static ones in .port_fdb_add
net: dsa: sja1105: invalidate dynamic FDB entries learned concurrently with statically added ones
net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too
net: dsa: sja1105: match FDB entries regardless of inner/outer VLAN tag
net: phy: micrel: Fix detection of ksz87xx switch
net: natsemi: Fix missing pci_disable_device() in probe and remove
gpio: tqmx86: really make IRQ optional
RDMA/mlx5: Delay emptying a cache entry when a new MR is added to it recently
sctp: move the active_key update after sh_keys is added
nfp: update ethtool reporting of pauseframe control
net: ipv6: fix returned variable type in ip6_skb_dst_mtu
net: dsa: qca: ar9331: reorder MDIO write sequence
net: sched: fix lockdep_set_class() typo error for sch->seqlock
MIPS: check return value of pgtable_pmd_page_ctor
mips: Fix non-POSIX regexp
bnx2x: fix an error code in bnx2x_nic_load()
net: pegasus: fix uninit-value in get_interrupt_interval
net: fec: fix use-after-free in fec_drv_remove
net: vxge: fix use-after-free in vxge_device_unregister
blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit()
Bluetooth: defer cleanup of resources in hci_unregister_dev()
USB: usbtmc: Fix RCU stall warning
USB: serial: option: add Telit FD980 composition 0x1056
USB: serial: ch341: fix character loss at high transfer rates
USB: serial: ftdi_sio: add device ID for Auto-M3 OP-COM v2
firmware_loader: use -ETIMEDOUT instead of -EAGAIN in fw_load_sysfs_fallback
firmware_loader: fix use-after-free in firmware_fallback_sysfs
drm/amdgpu/display: fix DMUB firmware version info
ALSA: pcm - fix mmap capability check for the snd-dummy driver
ALSA: hda/realtek: add mic quirk for Acer SF314-42
ALSA: hda/realtek: Fix headset mic for Acer SWIFT SF314-56 (ALC256)
ALSA: usb-audio: Fix superfluous autosuspend recovery
ALSA: usb-audio: Add registration quirk for JBL Quantum 600
usb: dwc3: gadget: Avoid runtime resume if disabling pullup
usb: gadget: remove leaked entry from udc driver list
usb: cdns3: Fixed incorrect gadget state
usb: gadget: f_hid: added GET_IDLE and SET_IDLE handlers
usb: gadget: f_hid: fixed NULL pointer dereference
usb: gadget: f_hid: idle uses the highest byte for duration
usb: host: ohci-at91: suspend/resume ports after/before OHCI accesses
usb: typec: tcpm: Keep other events when receiving FRS and Sourcing_vbus events
usb: otg-fsm: Fix hrtimer list corruption
clk: fix leak on devm_clk_bulk_get_all() unwind
scripts/tracing: fix the bug that can't parse raw_trace_func
tracing / histogram: Give calculation hist_fields a size
tracing: Reject string operand in the histogram expression
tracing: Fix NULL pointer dereference in start_creating
tracepoint: static call: Compare data on transition from 2->1 callees
tracepoint: Fix static call function vs data state mismatch
arm64: stacktrace: avoid tracing arch_stack_walk()
optee: Clear stale cache entries during initialization
tee: add tee_shm_alloc_kernel_buf()
optee: Fix memory leak when failing to register shm pages
optee: Refuse to load the driver under the kdump kernel
optee: fix tee out of memory failure seen during kexec reboot
tpm_ftpm_tee: Free and unregister TEE shared memory during kexec
staging: rtl8723bs: Fix a resource leak in sd_int_dpc
staging: rtl8712: get rid of flush_scheduled_work
staging: rtl8712: error handling refactoring
drivers core: Fix oops when driver probe fails
media: rtl28xxu: fix zero-length control request
pipe: increase minimum default pipe size to 2 pages
ext4: fix potential htree corruption when growing large_dir directories
serial: tegra: Only print FIFO error message when an error occurs
serial: 8250_mtk: fix uart corruption issue when rx power off
serial: 8250: Mask out floating 16/32-bit bus bits
MIPS: Malta: Do not byte-swap accesses to the CBUS UART
serial: 8250_pci: Enumerate Elkhart Lake UARTs via dedicated driver
serial: 8250_pci: Avoid irq sharing for MSI(-X) interrupts.
fpga: dfl: fme: Fix cpu hotplug issue in performance reporting
timers: Move clearing of base::timer_running under base:: Lock
xfrm: Fix RCU vs hash_resize_mutex lock inversion
net/xfrm/compat: Copy xfrm_spdattr_type_t atributes
pcmcia: i82092: fix a null pointer dereference bug
selinux: correct the return value when loads initial sids
bus: ti-sysc: AM3: RNG is GP only
Revert "gpio: mpc8xxx: change the gpio interrupt flags."
ARM: omap2+: hwmod: fix potential NULL pointer access
md/raid10: properly indicate failure when ending a failed write request
KVM: x86: accept userspace interrupt only if no event is injected
KVM: Do not leak memory for duplicate debugfs directories
KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds
arm64: vdso: Avoid ISB after reading from cntvct_el0
soc: ixp4xx: fix printing resources
interconnect: Fix undersized devress_alloc allocation
spi: meson-spicc: fix memory leak in meson_spicc_remove
interconnect: Zero initial BW after sync-state
interconnect: Always call pre_aggregate before aggregate
interconnect: qcom: icc-rpmh: Ensure floor BW is enforced for all nodes
drm/i915: Correct SFC_DONE register offset
soc: ixp4xx/qmgr: fix invalid __iomem access
perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest
sched/rt: Fix double enqueue caused by rt_effective_prio
drm/i915: avoid uninitialised var in eb_parse()
libata: fix ata_pio_sector for CONFIG_HIGHMEM
reiserfs: add check for root_inode in reiserfs_fill_super
reiserfs: check directory items on read from disk
virt_wifi: fix error on connect
net: qede: Fix end of loop tests for list_for_each_entry
alpha: Send stop IPI to send to online CPUs
net/qla3xxx: fix schedule while atomic in ql_wait_for_drvr_lock and ql_adapter_reset
smb3: rc uninitialized in one fallocate path
drm/amdgpu/display: only enable aux backlight control for OLED panels
arm64: fix compat syscall return truncation
Linux 5.10.58
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2533667974c9dff419a14d63e0e8febfb3de80f1
commit bb7262b295472eb6858b5c49893954794027cd84 upstream.
syzbot reported KCSAN data races vs. timer_base::timer_running being set to
NULL without holding base::lock in expire_timers().
This looks innocent and most reads are clearly not problematic, but
Frederic identified an issue which is:
int data = 0;
void timer_func(struct timer_list *t)
{
data = 1;
}
CPU 0 CPU 1
------------------------------ --------------------------
base = lock_timer_base(timer, &flags); raw_spin_unlock(&base->lock);
if (base->running_timer != timer) call_timer_fn(timer, fn, baseclk);
ret = detach_if_pending(timer, base, true); base->running_timer = NULL;
raw_spin_unlock_irqrestore(&base->lock, flags); raw_spin_lock(&base->lock);
x = data;
If the timer has previously executed on CPU 1 and then CPU 0 can observe
base->running_timer == NULL and returns, assuming the timer has completed,
but it's not guaranteed on all architectures. The comment for
del_timer_sync() makes that guarantee. Moving the assignment under
base->lock prevents this.
For non-RT kernel it's performance wise completely irrelevant whether the
store happens before or after taking the lock. For an RT kernel moving the
store under the lock requires an extra unlock/lock pair in the case that
there is a waiter for the timer, but that's not the end of the world.
Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com
Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com
Fixes: 030dcdd197 ("timers: Prepare support for PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.54
igc: Fix use-after-free error during reset
igb: Fix use-after-free error during reset
igc: change default return of igc_read_phy_reg()
ixgbe: Fix an error handling path in 'ixgbe_probe()'
igc: Fix an error handling path in 'igc_probe()'
igb: Fix an error handling path in 'igb_probe()'
fm10k: Fix an error handling path in 'fm10k_probe()'
e1000e: Fix an error handling path in 'e1000_probe()'
iavf: Fix an error handling path in 'iavf_probe()'
igb: Check if num of q_vectors is smaller than max before array access
igb: Fix position of assignment to *ring
gve: Fix an error handling path in 'gve_probe()'
net: add kcov handle to skb extensions
bonding: fix suspicious RCU usage in bond_ipsec_add_sa()
bonding: fix null dereference in bond_ipsec_add_sa()
ixgbevf: use xso.real_dev instead of xso.dev in callback functions of struct xfrmdev_ops
bonding: fix suspicious RCU usage in bond_ipsec_del_sa()
bonding: disallow setting nested bonding + ipsec offload
bonding: Add struct bond_ipesc to manage SA
bonding: fix suspicious RCU usage in bond_ipsec_offload_ok()
bonding: fix incorrect return value of bond_ipsec_offload_ok()
ipv6: fix 'disable_policy' for fwd packets
stmmac: platform: Fix signedness bug in stmmac_probe_config_dt()
selftests: icmp_redirect: remove from checking for IPv6 route get
selftests: icmp_redirect: IPv6 PMTU info should be cleared after redirect
pwm: sprd: Ensure configuring period and duty_cycle isn't wrongly skipped
cxgb4: fix IRQ free race during driver unload
mptcp: fix warning in __skb_flow_dissect() when do syn cookie for subflow join
nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
perf inject: Fix dso->nsinfo refcounting
perf map: Fix dso->nsinfo refcounting
perf probe: Fix dso->nsinfo refcounting
perf env: Fix sibling_dies memory leak
perf test session_topology: Delete session->evlist
perf test event_update: Fix memory leak of evlist
perf dso: Fix memory leak in dso__new_map()
perf test maps__merge_in: Fix memory leak of maps
perf env: Fix memory leak of cpu_pmu_caps
perf report: Free generated help strings for sort option
perf script: Fix memory 'threads' and 'cpus' leaks on exit
perf lzma: Close lzma stream on exit
perf probe-file: Delete namelist in del_events() on the error path
perf data: Close all files in close_dir()
perf sched: Fix record failure when CONFIG_SCHEDSTATS is not set
ASoC: wm_adsp: Correct wm_coeff_tlv_get handling
spi: imx: add a check for speed_hz before calculating the clock
spi: stm32: fixes pm_runtime calls in probe/remove
regulator: hi6421: Use correct variable type for regmap api val argument
regulator: hi6421: Fix getting wrong drvdata
spi: mediatek: fix fifo rx mode
ASoC: rt5631: Fix regcache sync errors on resume
bpf, test: fix NULL pointer dereference on invalid expected_attach_type
bpf: Fix tail_call_reachable rejection for interpreter when jit failed
xdp, net: Fix use-after-free in bpf_xdp_link_release
timers: Fix get_next_timer_interrupt() with no timers pending
liquidio: Fix unintentional sign extension issue on left shift of u16
s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1]
bpf, sockmap: Fix potential memory leak on unlikely error case
bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc stats
bpf, sockmap, udp: sk_prot needs inuse_idx set for proc stats
bpftool: Check malloc return value in mount_bpffs_for_pin
net: fix uninit-value in caif_seqpkt_sendmsg
usb: hso: fix error handling code of hso_create_net_device
dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}
efi/tpm: Differentiate missing and invalid final event log table.
net: decnet: Fix sleeping inside in af_decnet
KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash
KVM: PPC: Fix kvm_arch_vcpu_ioctl vcpu_load leak
net: sched: fix memory leak in tcindex_partial_destroy_work
sctp: trim optlen when it's a huge value in sctp_setsockopt
netrom: Decrease sock refcount when sock timers expire
scsi: iscsi: Fix iface sysfs attr detection
scsi: target: Fix protect handling in WRITE SAME(32)
spi: cadence: Correct initialisation of runtime PM again
ACPI: Kconfig: Fix table override from built-in initrd
bnxt_en: don't disable an already disabled PCI device
bnxt_en: Refresh RoCE capabilities in bnxt_ulp_probe()
bnxt_en: Add missing check for BNXT_STATE_ABORT_ERR in bnxt_fw_rset_task()
bnxt_en: Validate vlan protocol ID on RX packets
bnxt_en: Check abort error state in bnxt_half_open_nic()
net: hisilicon: rename CACHE_LINE_MASK to avoid redefinition
net/tcp_fastopen: fix data races around tfo_active_disable_stamp
ALSA: hda: intel-dsp-cfg: add missing ElkhartLake PCI ID
net: hns3: fix possible mismatches resp of mailbox
net: hns3: fix rx VLAN offload state inconsistent issue
spi: spi-bcm2835: Fix deadlock
net/sched: act_skbmod: Skip non-Ethernet packets
ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
ceph: don't WARN if we're still opening a session to an MDS
nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
Revert "USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem"
afs: Fix tracepoint string placement with built-in AFS
r8169: Avoid duplicate sysfs entry creation error
nvme: set the PRACT bit when using Write Zeroes with T10 PI
sctp: update active_key for asoc when old key is being replaced
tcp: disable TFO blackhole logic by default
net: dsa: sja1105: make VID 4095 a bridge VLAN too
net: sched: cls_api: Fix the the wrong parameter
drm/panel: raspberrypi-touchscreen: Prevent double-free
cifs: only write 64kb at a time when fallocating a small region of a file
cifs: fix fallocate when trying to allocate a hole.
proc: Avoid mixing integer types in mem_rw()
mmc: core: Don't allocate IDA for OF aliases
s390/ftrace: fix ftrace_update_ftrace_func implementation
s390/boot: fix use of expolines in the DMA code
ALSA: usb-audio: Add missing proc text entry for BESPOKEN type
ALSA: usb-audio: Add registration quirk for JBL Quantum headsets
ALSA: sb: Fix potential ABBA deadlock in CSP driver
ALSA: hda/realtek: Fix pop noise and 2 Front Mic issues on a machine
ALSA: hdmi: Expose all pins on MSI MS-7C94 board
ALSA: pcm: Call substream ack() method upon compat mmap commit
ALSA: pcm: Fix mmap capability check
Revert "usb: renesas-xhci: Fix handling of unknown ROM state"
usb: xhci: avoid renesas_usb_fw.mem when it's unusable
xhci: Fix lost USB 2 remote wake
KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow
KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state
usb: hub: Disable USB 3 device initiated lpm if exit latency is too high
usb: hub: Fix link power management max exit latency (MEL) calculations
USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS
usb: max-3421: Prevent corruption of freed memory
usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop()
USB: serial: option: add support for u-blox LARA-R6 family
USB: serial: cp210x: fix comments for GE CS1000
USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
usb: gadget: Fix Unbalanced pm_runtime_enable in tegra_xudc_probe
usb: dwc2: gadget: Fix GOUTNAK flow for Slave mode.
usb: dwc2: gadget: Fix sending zero length packet in DDMA mode.
usb: typec: stusb160x: register role switch before interrupt registration
firmware/efi: Tell memblock about EFI iomem reservations
tracepoints: Update static_call before tp_funcs when adding a tracepoint
tracing/histogram: Rename "cpu" to "common_cpu"
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
tracing: Synthetic event field_pos is an index not a boolean
btrfs: check for missing device in btrfs_trim_fs
media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf()
ixgbe: Fix packet corruption due to missing DMA sync
bus: mhi: core: Validate channel ID when processing command completions
posix-cpu-timers: Fix rearm racing against process tick
selftest: use mmap instead of posix_memalign to allocate memory
io_uring: explicitly count entries for poll reqs
io_uring: remove double poll entry on arm failure
userfaultfd: do not untag user pointers
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
hugetlbfs: fix mount mode command line processing
rbd: don't hold lock_rwsem while running_list is being drained
rbd: always kick acquire on "acquired" and "released" notifications
misc: eeprom: at24: Always append device id even if label property is set.
nds32: fix up stack guard gap
driver core: Prevent warning when removing a device link from unregistered consumer
drm: Return -ENOTTY for non-drm ioctls
drm/amdgpu: update golden setting for sienna_cichlid
net: dsa: mv88e6xxx: enable SerDes RX stats for Topaz
net: dsa: mv88e6xxx: enable SerDes PCS register dump via ethtool -d on Topaz
PCI: Mark AMD Navi14 GPU ATS as broken
bonding: fix build issue
skbuff: Release nfct refcount on napi stolen or re-used skbs
Documentation: Fix intiramfs script name
perf inject: Close inject.output on exit
usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI
drm/i915/gvt: Clear d3_entered on elsp cmd submission.
sfc: ensure correct number of XDP queues
xhci: add xhci_get_virt_ep() helper
skbuff: Fix build with SKB extensions disabled
Linux 5.10.54
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifd2823b47ab1544cd1f168b138624ffe060a471e
[ Upstream commit aebacb7f6ca1926918734faae14d1f0b6fae5cb7 ]
31cd0e119d ("timers: Recalculate next timer interrupt only when
necessary") subtly altered get_next_timer_interrupt()'s behaviour. The
function no longer consistently returns KTIME_MAX with no timers
pending.
In order to decide if there are any timers pending we check whether the
next expiry will happen NEXT_TIMER_MAX_DELTA jiffies from now.
Unfortunately, the next expiry time and the timer base clock are no
longer updated in unison. The former changes upon certain timer
operations (enqueue, expire, detach), whereas the latter keeps track of
jiffies as they move forward. Ultimately breaking the logic above.
A simplified example:
- Upon entering get_next_timer_interrupt() with:
jiffies = 1
base->clk = 0;
base->next_expiry = NEXT_TIMER_MAX_DELTA;
'base->next_expiry == base->clk + NEXT_TIMER_MAX_DELTA', the function
returns KTIME_MAX.
- 'base->clk' is updated to the jiffies value.
- The next time we enter get_next_timer_interrupt(), taking into account
no timer operations happened:
base->clk = 1;
base->next_expiry = NEXT_TIMER_MAX_DELTA;
'base->next_expiry != base->clk + NEXT_TIMER_MAX_DELTA', the function
returns a valid expire time, which is incorrect.
This ultimately might unnecessarily rearm sched's timer on nohz_full
setups, and add latency to the system[1].
So, introduce 'base->timers_pending'[2], update it every time
'base->next_expiry' changes, and use it in get_next_timer_interrupt().
[1] See tick_nohz_stop_tick().
[2] A quick pahole check on x86_64 and arm64 shows it doesn't make
'struct timer_base' any bigger.
Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Move the calc_index vendor hook one line ahead to cover more
use cases.
Bug: 181296757
Signed-off-by: Huang Yiwei <hyiwei@codeaurora.org>
Change-Id: I52231a3ccbe622021232c7a54354c5ac02cf3952
Since we're expecting timers more precisely in short period, add
a vendor hook to calc_index when adding timers. Then we can modify
the index this timer used to make it accurate.
Bug: 178758017
Signed-off-by: Huang Yiwei <hyiwei@codeaurora.org>
Change-Id: Ie0e6493ae7ad53b0cc57eb1bbcf8a0a11f652828
Export hrtimer_expire_entry/exit tracepoints, so that vendor modules
can register probes for these tracepoints.
Bug: 175936268
Change-Id: I739f369d3b56e09f8e9061fefdf25830e37e987e
Signed-off-by: Changki Kim <changki.kim@samsung.com>
With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.
This patch restores the spirit of commit f227e3ec3b ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.
The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
Cc: George Spelvin <lkml@sdf.org>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Non-cryptographic PRNGs may have great statistical properties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output. An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.
It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack. Oops.
This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key. (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.) Speed is prioritized over security;
attacks are rare, while performance is always wanted.
Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.
Commit f227e3ec3b ("random32: update the net random state on interrupt
and activity") was an earlier attempt at a solution. This patch replaces
it.
Reported-by: Amit Klein <aksecurity@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Fixes: f227e3ec3b ("random32: update the net random state on interrupt and activity")
Signed-off-by: George Spelvin <lkml@sdf.org>
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
[ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions
to prandom.h for later use; merged George's prandom_seed() proposal;
inlined siprand_u32(); replaced the net_rand_state[] array with 4
members to fix a build issue; cosmetic cleanups to make checkpatch
happy; fixed RANDOM32_SELFTEST build ]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Pull timekeeping updates from Thomas Gleixner:
"Updates for timekeeping, timers and related drivers:
Core:
- Early boot support for the NMI safe timekeeper by utilizing
local_clock() up to the point where timekeeping is initialized.
This allows printk() to store multiple timestamps in the ringbuffer
which is useful for coordinating dmesg information across a fleet
of machines.
- Provide a multi-timestamp accessor for printk()
- Make timer init more robust by checking for invalid timer flags.
- Comma vs semicolon fixes
Drivers:
- Support for new platforms in existing drivers (SP804 and Renesas
CMT)
- Comma vs semicolon fixes
* tag 'timers-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/armada-370-xp: Use semicolons rather than commas to separate statements
clocksource/drivers/mps2-timer: Use semicolons rather than commas to separate statements
timers: Mask invalid flags in do_init_timer()
clocksource/drivers/sp804: Enable Hisilicon sp804 timer 64bit mode
clocksource/drivers/sp804: Add support for Hisilicon sp804 timer
clocksource/drivers/sp804: Support non-standard register offset
clocksource/drivers/sp804: Prepare for support non-standard register offset
clocksource/drivers/sp804: Remove a mismatched comment
clocksource/drivers/sp804: Delete the leading "__" of some functions
clocksource/drivers/sp804: Remove unused sp804_timer_disable() and timer-sp804.h
clocksource/drivers/sp804: Cleanup clk_get_sys()
dt-bindings: timer: renesas,cmt: Document r8a774e1 CMT support
dt-bindings: timer: renesas,cmt: Document r8a7742 CMT support
alarmtimer: Convert comma to semicolon
timekeeping: Provide multi-timestamp accessor to NMI safe timekeeper
timekeeping: Utilize local_clock() for NMI safe timekeeper during early boot
do_init_timer() accepts any combination of timer flags handed in by the
caller without a sanity check, but only TIMER_DEFFERABLE, TIMER_PINNED and
TIMER_IRQSAFE are valid.
If the supplied flags have other bits set, this could result in
malfunction. If bits are set in TIMER_CPUMASK the first timer usage could
deference a cpu base which is outside the range of possible CPUs. If
TIMER_MIGRATION is set, then the switch_timer_base() will live lock.
Prevent that with a sanity check which warns when invalid flags are
supplied and masks them out.
[ tglx: Made it WARN_ON_ONCE() and added context to the changelog ]
Signed-off-by: Qianli Zhao <zhaoqianli@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/9d79a8aa4eb56713af7379f99f062dedabcde140.1597326756.git.zhaoqianli@xiaomi.com
Running posix CPU timers in hard interrupt context has a few downsides:
- For PREEMPT_RT it cannot work as the expiry code needs to take
sighand lock, which is a 'sleeping spinlock' in RT. The original RT
approach of offloading the posix CPU timer handling into a high
priority thread was clumsy and provided no real benefit in general.
- For fine grained accounting it's just wrong to run this in context of
the timer interrupt because that way a process specific CPU time is
accounted to the timer interrupt.
- Long running timer interrupts caused by a large amount of expiring
timers which can be created and armed by unpriviledged user space.
There is no hard requirement to expire them in interrupt context.
If the signal is targeted at the task itself then it won't be delivered
before the task returns to user space anyway. If the signal is targeted at
a supervisor process then it might be slightly delayed, but posix CPU
timers are inaccurate anyway due to the fact that they are tied to the
tick.
Provide infrastructure to schedule task work which allows splitting the
posix CPU timer code into a quick check in interrupt context and a thread
context expiry and signal delivery function. This has to be enabled by
architectures as it requires that the architecture specific KVM
implementation handles pending task work before exiting to guest mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
Pull timer updates from Thomas Gleixner:
"Time, timers and related driver updates:
- Prevent unnecessary timer softirq invocations by extending the
tracking of the next expiring timer in the timer wheel beyond the
existing NOHZ functionality.
The tracking overhead at enqueue time is within the noise, but on
sensitive workloads the avoidance of the soft interrupt invocation
is a measurable improvement.
- The obligatory new clocksource driver for Ingenic X100 OST
- The usual fixes, improvements, cleanups and extensions for newer
chip variants all over the driver space"
* tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
timers: Recalculate next timer interrupt only when necessary
clocksource/drivers/ingenic: Add support for the Ingenic X1000 OST.
dt-bindings: timer: Add Ingenic X1000 OST bindings.
clocksource/drivers: Replace HTTP links with HTTPS ones
clocksource/drivers/nomadik-mtu: Handle 32kHz clock
clocksource/drivers/sh_cmt: Use "kHz" for kilohertz
clocksource/drivers/imx: Add support for i.MX TPM driver with ARM64
clocksource/drivers/ingenic: Add high resolution timer support for SMP/SMT.
timers: Lower base clock forwarding threshold
timers: Remove must_forward_clk
timers: Spare timer softirq until next expiry
timers: Expand clk forward logic beyond nohz
timers: Reuse next expiry cache after nohz exit
timers: Always keep track of next expiry
timers: Optimize _next_timer_interrupt() level iteration
timers: Add comments about calc_index() ceiling work
timers: Move trigger_dyntick_cpu() to enqueue_timer()
timers: Use only bucket expiry for base->next_expiry value
timers: Preserve higher bits of expiration on index calculation
clocksource/drivers/timer-atmel-tcb: Add sama5d2 support
...
This modifies the first 32 bits out of the 128 bits of a random CPU's
net_rand_state on interrupt or CPU activity to complicate remote
observations that could lead to guessing the network RNG's internal
state.
Note that depending on some network devices' interrupt rate moderation
or binding, this re-seeding might happen on every packet or even almost
never.
In addition, with NOHZ some CPUs might not even get timer interrupts,
leaving their local state rarely updated, while they are running
networked processes making use of the random state. For this reason, we
also perform this update in update_process_times() in order to at least
update the state when there is user or system activity, since it's the
only case we care about.
Reported-by: Amit Klein <aksecurity@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nohz tick code recalculates the timer wheel's next expiry on each idle
loop iteration.
On the other hand, the base next expiry is now always cached and updated
upon timer enqueue and execution. Only timer dequeue may leave
base->next_expiry out of date (but then its stale value won't ever go past
the actual next expiry to be recalculated).
Since recalculating the next_expiry isn't a free operation, especially when
the last wheel level is reached to find out that no timer has been enqueued
at all, reuse the next expiry cache when it is known to be reliable, which
it is most of the time.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200723151641.12236-1-frederic@kernel.org
Now that the core timer infrastructure doesn't depend anymore on
periodic base->clk increments, even when the CPU is not in NO_HZ mode,
timer softirqs can be skipped until there are timers to expire.
Some spurious softirqs can still remain since base->next_expiry doesn't
keep track of canceled timers but this still reduces the number of softirqs
significantly: ~15 times less for HZ=1000 and ~5 times less for HZ=100.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-11-frederic@kernel.org
As for next_expiry, the base->clk catch up logic will be expanded beyond
NOHZ in order to avoid triggering useless softirqs.
If softirqs should only fire to expire pending timers, periodic base->clk
increments must be skippable for random amounts of time. Therefore prepare
to catch-up with missing updates whenever an up-to-date base clock is
needed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-10-frederic@kernel.org
If a level has a timer that expires before reaching the next level, there
is no need to iterate further.
The next level is reached when the 3 lower bits of the current level are
cleared. If the next event happens before/during that, the next levels
won't provide an earlier expiration.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-7-frederic@kernel.org
The bucket expiry time is the effective expriy time of timers and is
greater than or equal to the requested timer expiry time. This is due
to the guarantee that timers never expire early and the reduced expiry
granularity in the secondary wheel levels.
When a timer is enqueued, trigger_dyntick_cpu() checks whether the
timer is the new first timer. This check compares next_expiry with
the requested timer expiry value and not with the effective expiry
value of the bucket into which the timer was queued.
Storing the requested timer expiry value in base->next_expiry can lead
to base->clk going backwards if the requested timer expiry value is
smaller than base->clk. Commit 30c66fc30e ("timer: Prevent base->clk
from moving backward") worked around this by preventing the store when
timer->expiry is before base->clk, but did not fix the underlying
problem.
Use the expiry value of the bucket into which the timer is queued to
do the new first timer check. This fixes the base->clk going backward
problem.
The workaround of commit 30c66fc30e ("timer: Prevent base->clk from
moving backward") in trigger_dyntick_cpu() is not longer necessary as the
timers bucket expiry is guaranteed to be greater than or equal base->clk.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200717140551.29076-4-frederic@kernel.org
The higher bits of the timer expiration are cropped while calling
calc_index() due to the implicit cast from unsigned long to unsigned int.
This loss shouldn't have consequences on the current code since all the
computation to calculate the index is done on the lower 32 bits.
However to prepare for returning the actual bucket expiration from
calc_index() in order to properly fix base->next_expiry updates, the higher
bits need to be preserved.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200717140551.29076-3-frederic@kernel.org
When an expiration delta falls into the last level of the wheel, that delta
has be compared against the maximum possible delay and reduced to fit in if
necessary.
However instead of comparing the delta against the maximum, the code
compares the actual expiry against the maximum. Then instead of fixing the
delta to fit in, it sets the maximum delta as the expiry value.
This can result in various undesired outcomes, the worst possible one
being a timer expiring 15 days ahead to fire immediately.
Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200717140551.29076-2-frederic@kernel.org
When a timer is enqueued with a negative delta (ie: expiry is below
base->clk), it gets added to the wheel as expiring now (base->clk).
Yet the value that gets stored in base->next_expiry, while calling
trigger_dyntick_cpu(), is the initial timer->expires value. The
resulting state becomes:
base->next_expiry < base->clk
On the next timer enqueue, forward_timer_base() may accidentally
rewind base->clk. As a possible outcome, timers may expire way too
early, the worst case being that the highest wheel levels get spuriously
processed again.
To prevent from that, make sure that base->next_expiry doesn't get below
base->clk.
Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200703010657.2302-1-frederic@kernel.org
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull timekeeping and timer updates from Thomas Gleixner:
"Core:
- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.
This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which
is necessary to build the vDSO. These new headers are included from
the kernel headers and the vDSO specific files.
- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by
PPC.
- Cleanup and consolidation of the exit related code in posix CPU
timers.
- Small cleanups and enhancements here and there
Drivers:
- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
- Correct the clock rate of PIT64b global clock
- setup_irq() cleanup
- Preparation for PWM and suspend support for the TI DM timer
- Expand the fttmr010 driver to support ast2600 systems
- The usual small fixes, enhancements and cleanups all over the
place"
* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
vdso: Fix clocksource.h macro detection
um: Fix header inclusion
arm64: vdso32: Enable Clang Compilation
lib/vdso: Enable common headers
arm: vdso: Enable arm to use common headers
x86/vdso: Enable x86 to use common headers
mips: vdso: Enable mips to use common headers
arm64: vdso32: Include common headers in the vdso library
arm64: vdso: Include common headers in the vdso library
arm64: Introduce asm/vdso/processor.h
arm64: vdso32: Code clean up
linux/elfnote.h: Replace elf.h with UAPI equivalent
scripts: Fix the inclusion order in modpost
common: Introduce processor.h
linux/ktime.h: Extract common header for vDSO
linux/jiffies.h: Extract common header for vDSO
linux/time64.h: Extract common header for vDSO
linux/time32.h: Extract common header for vDSO
linux/time.h: Extract common header for vDSO
...
The timer_pending() function is mostly used in lockless contexts, so
Without proper annotations, KCSAN might detect a data-race [1].
Using hlist_unhashed_lockless() instead of hand-coding it seems
appropriate (as suggested by Paul E. McKenney).
[1]
BUG: KCSAN: data-race in del_timer / detach_if_pending
write to 0xffff88808697d870 of 8 bytes by task 10 on cpu 0:
__hlist_del include/linux/list.h:764 [inline]
detach_timer kernel/time/timer.c:815 [inline]
detach_if_pending+0xcd/0x2d0 kernel/time/timer.c:832
try_to_del_timer_sync+0x60/0xb0 kernel/time/timer.c:1226
del_timer_sync+0x6b/0xa0 kernel/time/timer.c:1365
schedule_timeout+0x2d2/0x6e0 kernel/time/timer.c:1896
rcu_gp_fqs_loop+0x37c/0x580 kernel/rcu/tree.c:1639
rcu_gp_kthread+0x143/0x230 kernel/rcu/tree.c:1799
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
read to 0xffff88808697d870 of 8 bytes by task 12060 on cpu 1:
del_timer+0x3b/0xb0 kernel/time/timer.c:1198
sk_stop_timer+0x25/0x60 net/core/sock.c:2845
inet_csk_clear_xmit_timers+0x69/0xa0 net/ipv4/inet_connection_sock.c:523
tcp_clear_xmit_timers include/net/tcp.h:606 [inline]
tcp_v4_destroy_sock+0xa3/0x3f0 net/ipv4/tcp_ipv4.c:2096
inet_csk_destroy_sock+0xf4/0x250 net/ipv4/inet_connection_sock.c:836
tcp_close+0x6f3/0x970 net/ipv4/tcp.c:2497
inet_release+0x86/0x100 net/ipv4/af_inet.c:427
__sock_release+0x85/0x160 net/socket.c:590
sock_close+0x24/0x30 net/socket.c:1268
__fput+0x1e1/0x520 fs/file_table.c:280
____fput+0x1f/0x30 fs/file_table.c:313
task_work_run+0xf6/0x130 kernel/task_work.c:113
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_usermode_loop+0x2b4/0x2c0 arch/x86/entry/common.c:163
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 12060 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ paulmck: Pulled in Eric's later amendments. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When working commit 6dcd5d7a7a, a mistake was noticed by Linus:
schedule_timeout() was called without setting the task state to anything
particular.
It calls the scheduler, but doesn't delay anything, because the task stays
runnable. That happens because sched_submit_work() does nothing for tasks
in TASK_RUNNING state.
That turned out to be the intended behavior. Adding a WARN() is not useful
as the task could be woken up right after setting the state and before
reaching schedule_timeout().
Improve the comment about schedule_timeout() and describe that more
explicitly.
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200117225900.16340-1-alex.popov@linux.com
The timer delayed for more than 3 seconds warning was triggered during
testing.
Workqueue: events_unbound sched_tick_remote
RIP: 0010:sched_tick_remote+0xee/0x100
...
Call Trace:
process_one_work+0x18c/0x3a0
worker_thread+0x30/0x380
kthread+0x113/0x130
ret_from_fork+0x22/0x40
The reason is that the code in collect_expired_timers() uses jiffies
unprotected:
if (next_event > jiffies)
base->clk = jiffies;
As the compiler is allowed to reload the value base->clk can advance
between the check and the store and in the worst case advance farther than
next event. That causes the timer expiry to be delayed until the wheel
pointer wraps around.
Convert the code to use READ_ONCE()
Fixes: 236968383c ("timers: Optimize collect_expired_timers() for NOHZ")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1568894687-14499-1-git-send-email-lirongqing@baidu.com
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted. If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling del_timer_sync() can lead to two issues:
- If the caller is on a remote CPU then it has to spin wait for the timer
handler to complete. This can result in unbound priority inversion.
- If the caller originates from the task which preempted the timer
handler on the same CPU, then spin waiting for the timer handler to
complete is never going to end.
To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If del_timer_sync() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.
This addresses both the priority inversion and the life lock issues.
This mechanism is not used for timers which are marked IRQSAFE as for those
preemption is disabled accross the callback and therefore this situation
cannot happen. The callbacks for such timers need to be individually
audited for RT compliance.
The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
del_timer_sync() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.
As the softirq thread can be preempted with PREEMPT_RT=y, the SMP variant
of del_timer_sync() needs to be used on UP as well.
[ tglx: Refactored it for mainline ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.832418500@linutronix.de
Pull printk updates from Petr Mladek:
- Allow state reset of printk_once() calls.
- Prevent crashes when dereferencing invalid pointers in vsprintf().
Only the first byte is checked for simplicity.
- Make vsprintf warnings consistent and inlined.
- Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
modifiers.
- Some clean up of vsprintf and test_printf code.
* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
lib/vsprintf: Make function pointer_string static
vsprintf: Limit the length of inlined error messages
vsprintf: Avoid confusion between invalid address and value
vsprintf: Prevent crash when dereferencing invalid pointers
vsprintf: Consolidate handling of unknown pointer specifiers
vsprintf: Factor out %pO handler as kobject_string()
vsprintf: Factor out %pV handler as va_format()
vsprintf: Factor out %p[iI] handler as ip_addr_string()
vsprintf: Do not check address of well-known strings
vsprintf: Consistent %pK handling for kptr_restrict == 0
vsprintf: Shuffle restricted_pointer()
printk: Tie printk_once / printk_deferred_once into .data.once for reset
treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
lib/test_printf: Switch to bitmap_zalloc()
Timers are added to the timer wheel off by one. This is required in
case a timer is queued directly before incrementing jiffies to prevent
early timer expiry.
When reading a timer trace and relying only on the expiry time of the timer
in the timer_start trace point and on the now in the timer_expiry_entry
trace point, it seems that the timer fires late. With the current
timer_expiry_entry trace point information only now=jiffies is printed but
not the value of base->clk. This makes it impossible to draw a conclusion
to the index of base->clk and makes it impossible to examine timer problems
without additional trace points.
Therefore add the base->clk value to the timer_expire_entry trace
point, to be able to calculate the index the timer base is located at
during collecting expired timers.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-5-anna-maria@linutronix.de
When placing the timer_start trace point before the timer wheel bucket
index is calculated, the index information in the trace point is useless.
It is not possible to simply move the debug_activate() call after the index
calculation, because debug_object_activate() needs to be called before
touching the object.
Therefore split debug_activate() and move the trace point into
enqueue_timer() after the new index has been calculated. The
debug_object_activate() call remains at the original place.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-3-anna-maria@linutronix.de
The name rcu_check_callbacks() arguably made sense back in the early
2000s when RCU was quite a bit simpler than it is today, but it has
become quite misleading, especially with the advent of dyntick-idle
and NO_HZ_FULL. The rcu_check_callbacks() function is RCU's hook into
the scheduling-clock interrupt, and is now but one of many ways that
callbacks get promoted to invocable state.
This commit therefore changes the name to rcu_sched_clock_irq(),
which is the same number of characters and clearly indicates this
function's relation to the rest of the Linux kernel. In addition, for
the sake of consistency, rcu_flavor_check_callbacks() is also renamed
to rcu_flavor_sched_clock_irq().
While in the area, the header comments for both functions are reworked.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
timer_base::must_forward_clock is indicating that the base clock might be
stale due to a long idle sleep.
The forwarding of the base clock takes place in the timer softirq or when a
timer is enqueued to a base which is idle. If the enqueue of timer to an
idle base happens from a remote CPU, then the following race can happen:
CPU0 CPU1
run_timer_softirq mod_timer
base = lock_timer_base(timer);
base->must_forward_clk = false
if (base->must_forward_clk)
forward(base); -> skipped
enqueue_timer(base, timer, idx);
-> idx is calculated high due to
stale base
unlock_timer_base(timer);
base = lock_timer_base(timer);
forward(base);
The root cause is that timer_base::must_forward_clk is cleared outside the
timer_base::lock held region, so the remote queuing CPU observes it as
cleared, but the base clock is still stale. This can cause large
granularity values for timers, i.e. the accuracy of the expiry time
suffers.
Prevent this by clearing the flag with timer_base::lock held, so that the
forwarding takes place before the cleared flag is observable by a remote
CPU.
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.org