Changes in 5.10.65
locking/mutex: Fix HANDOFF condition
regmap: fix the offset of register error log
regulator: tps65910: Silence deferred probe error
crypto: mxs-dcp - Check for DMA mapping errors
sched/deadline: Fix reset_on_fork reporting of DL tasks
power: supply: axp288_fuel_gauge: Report register-address on readb / writeb errors
crypto: omap-sham - clear dma flags only after omap_sham_update_dma_stop()
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
rcu/tree: Handle VM stoppage in stall detection
EDAC/mce_amd: Do not load edac_mce_amd module on guests
posix-cpu-timers: Force next expiration recalc after itimer reset
hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()
hrtimer: Ensure timerfd notification for HIGHRES=n
udf: Check LVID earlier
udf: Fix iocharset=utf8 mount option
isofs: joliet: Fix iocharset=utf8 mount option
bcache: add proper error unwinding in bcache_device_init
blk-throtl: optimize IOPS throttle for large IO scenarios
nvme-tcp: don't update queue count when failing to set io queues
nvme-rdma: don't update queue count when failing to set io queues
nvmet: pass back cntlid on successful completion
power: supply: smb347-charger: Add missing pin control activation
power: supply: max17042_battery: fix typo in MAx17042_TOFF
s390/cio: add dev_busid sysfs entry for each subchannel
s390/zcrypt: fix wrong offset index for APKA master key valid state
libata: fix ata_host_start()
crypto: omap - Fix inconsistent locking of device lists
crypto: qat - do not ignore errors from enable_vf2pf_comms()
crypto: qat - handle both source of interrupt in VF ISR
crypto: qat - fix reuse of completion variable
crypto: qat - fix naming for init/shutdown VF to PF notifications
crypto: qat - do not export adf_iov_putmsg()
fcntl: fix potential deadlock for &fasync_struct.fa_lock
udf_get_extendedattr() had no boundary checks.
s390/kasan: fix large PMD pages address alignment check
s390/pci: fix misleading rc in clp_set_pci_fn()
s390/debug: keep debug data on resize
s390/debug: fix debug area life cycle
s390/ap: fix state machine hang after failure to enable irq
power: supply: cw2015: use dev_err_probe to allow deferred probe
m68k: emu: Fix invalid free in nfeth_cleanup()
sched/numa: Fix is_core_idle()
sched: Fix UCLAMP_FLAG_IDLE setting
rcu: Fix to include first blocked task in stall warning
rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees
rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
m68k: Fix invalid RMW_INSNS on CPUs that lack CAS
block: return ELEVATOR_DISCARD_MERGE if possible
spi: spi-fsl-dspi: Fix issue with uninitialized dma_slave_config
spi: spi-pic32: Fix issue with uninitialized dma_slave_config
genirq/timings: Fix error return code in irq_timings_test_irqs()
irqchip/loongson-pch-pic: Improve edge triggered interrupt support
lib/mpi: use kcalloc in mpi_resize
clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
block: nbd: add sanity check for first_minor
spi: coldfire-qspi: Use clk_disable_unprepare in the remove function
irqchip/gic-v3: Fix priority comparison when non-secure priorities are used
crypto: qat - use proper type for vf_mask
certs: Trigger creation of RSA module signing key if it's not an RSA key
tpm: ibmvtpm: Avoid error message when process gets signal while waiting
x86/mce: Defer processing of early errors
spi: davinci: invoke chipselect callback
blk-crypto: fix check for too-large dun_bytes
regulator: vctrl: Use locked regulator_get_voltage in probe path
regulator: vctrl: Avoid lockdep warning in enable/disable ops
spi: sprd: Fix the wrong WDG_LOAD_VAL
spi: spi-zynq-qspi: use wait_for_completion_timeout to make zynq_qspi_exec_mem_op not interruptible
EDAC/i10nm: Fix NVDIMM detection
drm/panfrost: Fix missing clk_disable_unprepare() on error in panfrost_clk_init()
drm/gma500: Fix end of loop tests for list_for_each_entry
ASoC: mediatek: mt8183: Fix Unbalanced pm_runtime_enable in mt8183_afe_pcm_dev_probe
media: TDA1997x: enable EDID support
leds: is31fl32xx: Fix missing error code in is31fl32xx_parse_dt()
soc: rockchip: ROCKCHIP_GRF should not default to y, unconditionally
media: cxd2880-spi: Fix an error handling path
drm/of: free the right object
bpf: Fix a typo of reuseport map in bpf.h.
bpf: Fix potential memleak and UAF in the verifier.
drm/of: free the iterator object on failure
gve: fix the wrong AdminQ buffer overflow check
libbpf: Fix the possible memory leak on error
ARM: dts: aspeed-g6: Fix HVI3C function-group in pinctrl dtsi
arm64: dts: renesas: r8a77995: draak: Remove bogus adv7511w properties
i40e: improve locking of mac_filter_hash
soc: qcom: rpmhpd: Use corner in power_off
libbpf: Fix removal of inner map in bpf_object__create_map
gfs2: Fix memory leak of object lsi on error return path
firmware: fix theoretical UAF race with firmware cache and resume
driver core: Fix error return code in really_probe()
ionic: cleanly release devlink instance
media: dvb-usb: fix uninit-value in dvb_usb_adapter_dvb_init
media: dvb-usb: fix uninit-value in vp702x_read_mac_addr
media: dvb-usb: Fix error handling in dvb_usb_i2c_init
media: go7007: fix memory leak in go7007_usb_probe
media: go7007: remove redundant initialization
media: rockchip/rga: use pm_runtime_resume_and_get()
media: rockchip/rga: fix error handling in probe
media: coda: fix frame_mem_ctrl for YUV420 and YVU420 formats
media: atomisp: fix the uninitialized use and rename "retvalue"
Bluetooth: sco: prevent information leak in sco_conn_defer_accept()
6lowpan: iphc: Fix an off-by-one check of array index
drm/amdgpu/acp: Make PM domain really work
tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos
ARM: dts: meson8: Use a higher default GPU clock frequency
ARM: dts: meson8b: odroidc1: Fix the pwm regulator supply properties
ARM: dts: meson8b: mxq: Fix the pwm regulator supply properties
ARM: dts: meson8b: ec100: Fix the pwm regulator supply properties
net/mlx5e: Prohibit inner indir TIRs in IPoIB
net/mlx5e: Block LRO if firmware asks for tunneled LRO
cgroup/cpuset: Fix a partition bug with hotplug
drm: mxsfb: Enable recovery on underflow
drm: mxsfb: Increase number of outstanding requests on V4 and newer HW
drm: mxsfb: Clear FIFO_CLEAR bit
net: cipso: fix warnings in netlbl_cipsov4_add_std
Bluetooth: mgmt: Fix wrong opcode in the response for add_adv cmd
arm64: dts: renesas: rzg2: Convert EtherAVB to explicit delay handling
arm64: dts: renesas: hihope-rzg2-ex: Add EtherAVB internal rx delay
devlink: Break parameter notification sequence to be before/after unload/load driver
net/mlx5: Fix missing return value in mlx5_devlink_eswitch_inline_mode_set()
i2c: highlander: add IRQ check
leds: lt3593: Put fwnode in any case during ->probe()
leds: trigger: audio: Add an activate callback to ensure the initial brightness is set
media: em28xx-input: fix refcount bug in em28xx_usb_disconnect
media: venus: venc: Fix potential null pointer dereference on pointer fmt
PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently
PCI: PM: Enable PME if it can be signaled from D3cold
bpf, samples: Add missing mprog-disable to xdp_redirect_cpu's optstring
soc: qcom: smsm: Fix missed interrupts if state changes while masked
debugfs: Return error during {full/open}_proxy_open() on rmmod
Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow
PM: EM: Increase energy calculation precision
selftests/bpf: Fix bpf-iter-tcp4 test to print correctly the dest IP
drm/msm/mdp4: refactor HW revision detection into read_mdp_hw_revision
drm/msm/mdp4: move HW revision detection to earlier phase
drm/msm/dpu: make dpu_hw_ctl_clear_all_blendstages clear necessary LMs
arm64: dts: exynos: correct GIC CPU interfaces address range on Exynos7
counter: 104-quad-8: Return error when invalid mode during ceiling_write
cgroup/cpuset: Miscellaneous code cleanup
cgroup/cpuset: Fix violation of cpuset locking rule
ASoC: Intel: Fix platform ID matching
Bluetooth: fix repeated calls to sco_sock_kill
drm/msm/dsi: Fix some reference counted resource leaks
net/mlx5: Register to devlink ingress VLAN filter trap
net/mlx5: Fix unpublish devlink parameters
ASoC: rt5682: Implement remove callback
ASoC: rt5682: Properly turn off regulators if wrong device ID
usb: dwc3: meson-g12a: add IRQ check
usb: dwc3: qcom: add IRQ check
usb: gadget: udc: at91: add IRQ check
usb: gadget: udc: s3c2410: add IRQ check
usb: phy: fsl-usb: add IRQ check
usb: phy: twl6030: add IRQ checks
usb: gadget: udc: renesas_usb3: Fix soc_device_match() abuse
selftests/bpf: Fix test_core_autosize on big-endian machines
devlink: Clear whole devlink_flash_notify struct
samples: pktgen: add missing IPv6 option to pktgen scripts
Bluetooth: Move shutdown callback before flushing tx and rx queue
PM: cpu: Make notifier chain use a raw_spinlock_t
usb: host: ohci-tmio: add IRQ check
usb: phy: tahvo: add IRQ check
libbpf: Re-build libbpf.so when libbpf.map changes
mac80211: Fix insufficient headroom issue for AMSDU
locking/lockdep: Mark local_lock_t
locking/local_lock: Add missing owner initialization
lockd: Fix invalid lockowner cast after vfs_test_lock
nfsd4: Fix forced-expiry locking
arm64: dts: marvell: armada-37xx: Extend PCIe MEM space
clk: staging: correct reference to config IOMEM to config HAS_IOMEM
i2c: synquacer: fix deferred probing
firmware: raspberrypi: Keep count of all consumers
firmware: raspberrypi: Fix a leak in 'rpi_firmware_get()'
usb: gadget: mv_u3d: request_irq() after initializing UDC
mm/swap: consider max pages in iomap_swapfile_add_extent
lkdtm: replace SCSI_DISPATCH_CMD with SCSI_QUEUE_RQ
Bluetooth: add timeout sanity check to hci_inquiry
i2c: iop3xx: fix deferred probing
i2c: s3c2410: fix IRQ check
i2c: fix platform_get_irq.cocci warnings
i2c: hix5hd2: fix IRQ check
gfs2: init system threads before freeze lock
rsi: fix error code in rsi_load_9116_firmware()
rsi: fix an error code in rsi_probe()
ASoC: Intel: kbl_da7219_max98927: Fix format selection for max98373
ASoC: Intel: Skylake: Leave data as is when invoking TLV IPCs
ASoC: Intel: Skylake: Fix module resource and format selection
mmc: sdhci: Fix issue with uninitialized dma_slave_config
mmc: dw_mmc: Fix issue with uninitialized dma_slave_config
mmc: moxart: Fix issue with uninitialized dma_slave_config
bpf: Fix possible out of bound write in narrow load handling
CIFS: Fix a potencially linear read overflow
i2c: mt65xx: fix IRQ check
i2c: xlp9xx: fix main IRQ check
usb: ehci-orion: Handle errors of clk_prepare_enable() in probe
usb: bdc: Fix an error handling path in 'bdc_probe()' when no suitable DMA config is available
usb: bdc: Fix a resource leak in the error handling path of 'bdc_probe()'
tty: serial: fsl_lpuart: fix the wrong mapbase value
ASoC: wcd9335: Fix a double irq free in the remove function
ASoC: wcd9335: Fix a memory leak in the error handling path of the probe function
ASoC: wcd9335: Disable irq on slave ports in the remove function
iwlwifi: follow the new inclusive terminology
iwlwifi: skip first element in the WTAS ACPI table
ice: Only lock to update netdev dev_addr
ath6kl: wmi: fix an error code in ath6kl_wmi_sync_point()
atlantic: Fix driver resume flow.
bcma: Fix memory leak for internally-handled cores
brcmfmac: pcie: fix oops on failure to resume and reprobe
ipv6: make exception cache less predictible
ipv4: make exception cache less predictible
net: sched: Fix qdisc_rate_table refcount leak when get tcf_block failed
net: qualcomm: fix QCA7000 checksum handling
octeontx2-af: Fix loop in free and unmap counter
octeontx2-af: Fix static code analyzer reported issues
octeontx2-af: Set proper errorcode for IPv4 checksum errors
ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
ASoC: rt5682: Remove unused variable in rt5682_i2c_remove()
iwlwifi Add support for ax201 in Samsung Galaxy Book Flex2 Alpha
f2fs: guarantee to write dirty data when enabling checkpoint back
time: Handle negative seconds correctly in timespec64_to_ns()
io_uring: IORING_OP_WRITE needs hash_reg_file set
bio: fix page leak bio_add_hw_page failure
tty: Fix data race between tiocsti() and flush_to_ldisc()
perf/x86/amd/ibs: Extend PERF_PMU_CAP_NO_EXCLUDE to IBS Op
x86/resctrl: Fix a maybe-uninitialized build warning treated as error
Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
KVM: s390: index kvm->arch.idle_mask by vcpu_idx
KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter
ARM: dts: at91: add pinctrl-{names, 0} for all gpios
fuse: truncate pagecache on atomic_o_trunc
fuse: flush extending writes
IMA: remove -Wmissing-prototypes warning
IMA: remove the dependency on CRYPTO_MD5
fbmem: don't allow too huge resolutions
backlight: pwm_bl: Improve bootloader/kernel device handover
clk: kirkwood: Fix a clocking boot regression
Linux 5.10.65
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie0b9306ba6ee4193de3200df7cdacaeba152b83e
[ Upstream commit 8c3b5e6ec0fee18bc2ce38d1dfe913413205f908 ]
If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.
Make clock_was_set_delayed() unconditially available to fix that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
Try to mitigate potential future driver core api changes by adding a
padding to struct hrtimer.
Based on a change made to the RHEL/CENTOS 8 kernel.
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5432e05386265281d993199599c6f9dcd17a9daf
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_raw_spinlock_t data type, which allows to associate
a raw spinlock with the sequence counter. This enables lockdep to verify
that the raw spinlock used for writer serialization is held when the
write side critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-25-a.darwish@linutronix.de
clock_nanosleep() accepts absolute values of expiration time when
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace, and has to be converted to the host's time.
There is timens_ktime_to_host() helper for converting time, but
it accepts ktime argument.
As a preparation, make hrtimer_nanosleep() accept a clock value in ktime
instead of timespec64.
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-17-dima@arista.com
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted. If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling hrtimer_cancel() can lead to two issues:
- If the caller is on a remote CPU then it has to spin wait for the timer
handler to complete. This can result in unbound priority inversion.
- If the caller originates from the task which preempted the timer
handler on the same CPU, then spin waiting for the timer handler to
complete is never going to end.
To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If hrtimer_cancel() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.
This addresses both the priority inversion and the life lock issues.
The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
hrtimer_cancel() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.
[ tglx: Refactored it for mainline ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.737767218@linutronix.de
hrtimer_start_range_ns() has a WARN_ONCE() which verifies that a timer
which is marker for softirq expiry is not queued in the hard interrupt base
and vice versa.
When PREEMPT_RT is enabled, timers which are not explicitely marked to
expire in hard interrupt context are deferrred to the soft interrupt. So
the regular check would trigger.
Change the check, so when PREEMPT_RT is enabled, it is verified that the
timers marked for hard interrupt expiry are not tried to be queued for soft
interrupt expiry or any of the unmarked and softirq marked is tried to be
expired in hard interrupt context.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On PREEMPT_RT not all hrtimers can be expired in hard interrupt context
even if that is perfectly fine on a PREEMPT_RT=n kernel, e.g. because they
take regular spinlocks. Also for latency reasons PREEMPT_RT tries to defer
most hrtimers' expiry into soft interrupt context.
But there are hrtimers which must be expired in hard interrupt context even
when PREEMPT_RT is enabled:
- hrtimers which must expiry in hard interrupt context, e.g. scheduler,
perf, watchdog related hrtimers
- latency critical hrtimers, e.g. nanosleep, ..., kvm lapic timer
Add a new mode flag HRTIMER_MODE_HARD which allows to mark these timers so
PREEMPT_RT will not move them into softirq expiry mode.
[ tglx: Split out of a larger combo patch. Added changelog ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.981398465@linutronix.de
hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Create a wrapper around hrtimer_start_expires() to make that
possible.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
object which is embedded into the hrtimer_sleeper.
Combine the initialization and spare a function call. Fixup all call sites.
This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
specific initializations of the embedded hrtimer without modifying any of
the call sites.
No functional change.
[ anna-maria: Minor cleanups ]
[ tglx: Adopted to the removal of the task argument of
hrtimer_init_sleeper() and trivial polishing.
Folded a fix from Stephen Rothwell for the vsoc code ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de
Revert commits
92af4dcb4e ("tracing: Unify the "boot" and "mono" tracing clocks")
127bfa5f43 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior")
7250a4047a ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior")
d6c7270e91 ("timekeeping: Remove boot time specific code")
f2d6fdbfd2 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior")
d6ed449afd ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock")
72199320d4 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock")
As stated in the pull request for the unification of CLOCK_MONOTONIC and
CLOCK_BOOTTIME, it was clear that we might have to revert the change.
As reported by several folks systemd and other applications rely on the
documented behaviour of CLOCK_MONOTONIC on Linux and break with the above
changes. After resume daemons time out and other timeout related issues are
observed. Rafael compiled this list:
* systemd kills daemons on resume, after >WatchdogSec seconds
of suspending (Genki Sky). [Verified that that's because systemd uses
CLOCK_MONOTONIC and expects it to not include the suspend time.]
* systemd-journald misbehaves after resume:
systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal
corrupted or uncleanly shut down, renaming and replacing.
(Mike Galbraith).
* NetworkManager reports "networking disabled" and networking is broken
after resume 50% of the time (Pavel). [May be because of systemd.]
* MATE desktop dims the display and starts the screensaver right after
system resume (Pavel).
* Full system hang during resume (me). [May be due to systemd or NM or both.]
That happens on debian and open suse systems.
It's sad, that these problems were neither catched in -next nor by those
folks who expressed interest in this change.
Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Reported-by: Genki Sky <sky@genki.is>,
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
* pm-cpuidle:
tick-sched: avoid a maybe-uninitialized warning
cpuidle: Add definition of residency to sysfs documentation
time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
nohz: Avoid duplication of code related to got_idle_tick
nohz: Gather tick_sched booleans under a common flag field
cpuidle: menu: Avoid selecting shallow states with stopped tick
cpuidle: menu: Refine idle state selection for running tick
sched: idle: Select idle state before stopping the tick
time: hrtimer: Introduce hrtimer_next_event_without()
time: tick-sched: Split tick_nohz_stop_sched_tick()
cpuidle: Return nohz hint from cpuidle_select()
jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
sched: idle: Do not stop the tick before cpuidle_idle_call()
sched: idle: Do not stop the tick upfront in the idle loop
time: tick-sched: Reorganize idle tick management code
* pm-qos:
PM / QoS: mark expected switch fall-throughs
The next set of changes will need to compute the time to the next
hrtimer event over all hrtimers except for the scheduler tick one.
To that end introduce a new helper function,
hrtimer_next_event_without(), for computing the time until the next
hrtimer event over all timers except for one and modify the underlying
code in __hrtimer_next_event_base() to prepare it for being called by
that new function.
No intentional changes in functionality.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
hrtimer callbacks are always invoked in hard interrupt context. Several
users in tree require soft interrupt context for their callbacks and
achieve this by combining a hrtimer with a tasklet. The hrtimer schedules
the tasklet in hard interrupt context and the tasklet callback gets invoked
in softirq context later.
That's suboptimal and aside of that the real-time patch moves most of the
hrtimers into softirq context. So adding native support for hrtimers
expiring in softirq context is a valuable extension for both mainline and
the RT patch set.
Each valid hrtimer clock id has two associated hrtimer clock bases: one for
timers expiring in hardirq context and one for timers expiring in softirq
context.
Implement the functionality to associate a hrtimer with the hard or softirq
related clock bases and update the relevant functions to take them into
account when the next expiry time needs to be evaluated.
Add a check into the hard interrupt context handler functions to check
whether the first expiring softirq based timer has expired. If it's expired
the softirq is raised and the accounting of softirq based timers to
evaluate the next expiry time for programming the timer hardware is skipped
until the softirq processing has finished. At the end of the softirq
processing the regular processing is resumed.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-29-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently hrtimer callback functions are always executed in hard interrupt
context. Users of hrtimers, which need their timer function to be executed
in soft interrupt context, make use of tasklets to get the proper context.
Add additional hrtimer clock bases for timers which must expire in softirq
context, so the detour via the tasklet can be avoided. This is also
required for RT, where the majority of hrtimer is moved into softirq
hrtimer context.
The selection of the expiry mode happens via a mode bit. Introduce
HRTIMER_MODE_SOFT and the matching combinations with the ABS/REL/PINNED
bits and update the decoding of hrtimer_mode in tracepoints.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-27-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_cpu_base.next_timer stores the pointer to the next expiring timer
in a CPU base.
This pointer cannot be dereferenced and is solely used to check whether a
hrtimer which is removed is the hrtimer which is the first to expire in the
CPU base. If this is the case, then the timer hardware needs to be
reprogrammed to avoid an extra interrupt for nothing.
Again, this is conditional functionality, but there is no compelling reason
to make this conditional. As a preparation, hrtimer_cpu_base.next_timer
needs to be available unconditonally.
Aside of that the upcoming support for softirq based hrtimers requires access
to this pointer unconditionally as well, so our motivation is not entirely
simplicity based.
Make the update of hrtimer_cpu_base.next_timer unconditional and remove the
#ifdef cruft. The impact on CONFIG_HIGH_RES_TIMERS=n && CONFIG_NOHZ=n is
marginal as it's just a store on an already dirtied cacheline.
No functional change.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-17-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
hrtimer_cpu_base.expires_next is used to cache the next event armed in the
timer hardware. The value is used to check whether an hrtimer can be
enqueued remotely. If the new hrtimer is expiring before expires_next, then
remote enqueue is not possible as the remote hrtimer hardware cannot be
accessed for reprogramming to an earlier expiry time.
The remote enqueue check is currently conditional on
CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active. There is no
compelling reason to make this conditional.
Move hrtimer_cpu_base.expires_next out of the CONFIG_HIGH_RES_TIMERS=y
guarded area and remove the conditionals in hrtimer_check_target().
The check is currently a NOOP for the CONFIG_HIGH_RES_TIMERS=n and the
!hrtimer_cpu_base.hres_active case because in these cases nothing updates
hrtimer_cpu_base.expires_next yet. This will be changed with later patches
which further reduce the #ifdef zoo in this code.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-16-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The pointer to the currently running timer is stored in hrtimer_cpu_base
before the base lock is dropped and the callback is invoked.
This results in two levels of indirections and the upcoming support for
softirq based hrtimer requires splitting the "running" storage into soft
and hard IRQ context expiry.
Storing both in the cpu base would require conditionals in all code paths
accessing that information.
It's possible to have a per clock base sequence count and running pointer
without changing the semantics of the related mechanisms because the timer
base pointer cannot be changed while a timer is running the callback.
Unfortunately this makes cpu_clock base larger than 32 bytes on 32-bit
kernels. Instead of having huge gaps due to alignment, remove the alignment
and let the compiler pack CPU base for 32-bit kernels. The resulting cache access
patterns are fortunately not really different from the current
behaviour. On 64-bit kernels the 64-byte alignment stays and the behaviour is
unchanged. This was determined by analyzing the resulting layout and
looking at the number of cache lines involved for the frequently used
clocks.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-12-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
schedule_hrtimeout_range_clock() uses an 'int clock' parameter for the
clock ID, instead of the customary predefined "clockid_t" type.
In hrtimer coding style the canonical variable name for the clock ID is
'clock_id', therefore change the name of the parameter here as well
to make it all consistent.
While at it, clean up the description for the 'clock_id' and 'mode'
function parameters. The clock modes and the clock IDs are not
restricted as the comment suggests.
Fix the mode description as well for the callers of schedule_hrtimeout_range_clock().
No functional changes intended.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-5-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Usage of these apis and their compat versions makes
the syscalls: clock_nanosleep and nanosleep and
their compat implementations simpler.
This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In our quest to simplify <linux/sched.h>'s header dependencies, remove
the <linux/wait.h> inclusion from <linux/hrtimer.h> - which does
not appear to be necessary, as hrtimer.h does not use waitqueues.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).
The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.
The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.
With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:
$ time sleep 1
real 0m10.747s
user 0m0.001s
sys 0m0.005s
The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.
Other than that things are fairly straightforward.
This patch (of 2):
The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).
This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.
Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.
This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If CONFIG_TIME_LOW_RES is enabled we add a jiffie to the relative timeout to
prevent short sleeps, but we do not account for that in interfaces which
retrieve the remaining time.
Helge observed that timerfd can return a remaining time larger than the
relative timeout. That's not expected and breaks userland test programs.
Store the information that the timer was armed relative and provide functions
to adjust the remaining time. To avoid bloating the hrtimer struct make state
a u8, which as a bonus results in better code on x86 at least.
Reported-and-tested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: dhowells@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20160114164159.273328486@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If nohz is disabled on the kernel command line the [hr]timer code
still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
avoid the overhead.
Before:
48.10% hog [.] main
15.25% [kernel] [k] _raw_spin_lock_irqsave
9.76% [kernel] [k] _raw_spin_unlock_irqrestore
6.50% [kernel] [k] mod_timer
6.44% [kernel] [k] lock_timer_base.isra.38
3.87% [kernel] [k] detach_if_pending
3.80% [kernel] [k] del_timer
2.67% [kernel] [k] internal_add_timer
1.33% [kernel] [k] __internal_add_timer
0.73% [kernel] [k] timerfn
0.54% [kernel] [k] wake_up_nohz_cpu
After:
48.73% hog [.] main
15.36% [kernel] [k] _raw_spin_lock_irqsave
9.77% [kernel] [k] _raw_spin_unlock_irqrestore
6.61% [kernel] [k] lock_timer_base.isra.38
6.42% [kernel] [k] mod_timer
3.90% [kernel] [k] detach_if_pending
3.76% [kernel] [k] del_timer
2.41% [kernel] [k] internal_add_timer
1.39% [kernel] [k] __internal_add_timer
0.76% [kernel] [k] timerfn
We probably should have a cached value for nohz full in the per cpu
bases as well to avoid the cpumask check. The base cache line is hot
already, the cpumask not necessarily.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.
We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.
The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.
With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.
Before:
47.76% hog [.] main
14.84% [kernel] [k] _raw_spin_lock_irqsave
9.55% [kernel] [k] _raw_spin_unlock_irqrestore
6.71% [kernel] [k] mod_timer
6.24% [kernel] [k] lock_timer_base.isra.38
3.76% [kernel] [k] detach_if_pending
3.71% [kernel] [k] del_timer
2.50% [kernel] [k] internal_add_timer
1.51% [kernel] [k] get_nohz_timer_target
1.28% [kernel] [k] __internal_add_timer
0.78% [kernel] [k] timerfn
0.48% [kernel] [k] wake_up_nohz_cpu
After:
48.10% hog [.] main
15.25% [kernel] [k] _raw_spin_lock_irqsave
9.76% [kernel] [k] _raw_spin_unlock_irqrestore
6.50% [kernel] [k] mod_timer
6.44% [kernel] [k] lock_timer_base.isra.38
3.87% [kernel] [k] detach_if_pending
3.80% [kernel] [k] del_timer
2.67% [kernel] [k] internal_add_timer
1.33% [kernel] [k] __internal_add_timer
0.73% [kernel] [k] timerfn
0.54% [kernel] [k] wake_up_nohz_cpu
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>