Changes in 5.10.213
mmc: mmci: stm32: use a buffer for unaligned DMA requests
mmc: mmci: stm32: fix DMA API overlapping mappings warning
lan78xx: Fix white space and style issues
lan78xx: Add missing return code checks
lan78xx: Fix partial packet errors on suspend/resume
lan78xx: Fix race conditions in suspend/resume handling
net: lan78xx: fix runtime PM count underflow on link stop
ixgbe: {dis, en}able irqs in ixgbe_txrx_ring_{dis, en}able
i40e: disable NAPI right after disabling irqs when handling xsk_pool
tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string
geneve: make sure to pull inner header in geneve_rx()
net: ice: Fix potential NULL pointer dereference in ice_bridge_setlink()
net/ipv6: avoid possible UAF in ip6_route_mpath_notify()
cpumap: Zero-initialise xdp_rxq_info struct before running XDP program
net/rds: fix WARNING in rds_conn_connect_if_down
netfilter: nft_ct: fix l3num expectations with inet pseudo family
netfilter: nf_conntrack_h323: Add protection for bmp length out of range
netrom: Fix a data-race around sysctl_netrom_default_path_quality
netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser
netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser
netrom: Fix a data-race around sysctl_netrom_transport_timeout
netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries
netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay
netrom: Fix a data-race around sysctl_netrom_transport_busy_delay
netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size
netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout
netrom: Fix a data-race around sysctl_netrom_routing_control
netrom: Fix a data-race around sysctl_netrom_link_fails_count
netrom: Fix data-races around sysctl_net_busy_read
selftests/mm: switch to bash from sh
selftests: mm: fix map_hugetlb failure on 64K page size systems
um: allow not setting extra rpaths in the linux binary
xhci: remove extra loop in interrupt context
xhci: prevent double-fetch of transfer and transfer event TRBs
xhci: process isoc TD properly when there was a transaction error mid TD.
xhci: handle isoc Babble and Buffer Overrun events properly
serial: max310x: Use devm_clk_get_optional() to get the input clock
serial: max310x: Try to get crystal clock rate from property
serial: max310x: fail probe if clock crystal is unstable
serial: max310x: Make use of device properties
serial: max310x: use regmap methods for SPI batch operations
serial: max310x: use a separate regmap for each port
serial: max310x: prevent infinite while() loop in port startup
net: Change sock_getsockopt() to take the sk ptr instead of the sock ptr
bpf: net: Change sk_getsockopt() to take the sockptr_t argument
lsm: make security_socket_getpeersec_stream() sockptr_t safe
lsm: fix default return value of the socket_getpeersec_*() hooks
ext4: make ext4_es_insert_extent() return void
ext4: refactor ext4_da_map_blocks()
ext4: convert to exclusive lock while inserting delalloc extents
Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardening
hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening
hv_netvsc: Wait for completion on request SWITCH_DATA_PATH
hv_netvsc: Process NETDEV_GOING_DOWN on VF hot remove
hv_netvsc: Make netvsc/VF binding check both MAC and serial number
hv_netvsc: use netif_is_bond_master() instead of open code
hv_netvsc: Register VF in netvsc_probe if NET_DEVICE_REGISTER missed
mm/hugetlb: change hugetlb_reserve_pages() to type bool
mm: hugetlb pages should not be reserved by shmat() if SHM_NORESERVE
getrusage: add the "signal_struct *sig" local variable
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
getrusage: use __for_each_thread()
getrusage: use sig->stats_lock rather than lock_task_sighand()
serial: max310x: Unprepare and disable clock in error path
Drivers: hv: vmbus: Drop error message when 'No request id available'
regmap: allow to define reg_update_bits for no bus configuration
regmap: Add bulk read/write callbacks into regmap_config
serial: max310x: make accessing revision id interface-agnostic
serial: max310x: implement I2C support
serial: max310x: fix IO data corruption in batched operations
Linux 5.10.213
Change-Id: I3450b2b1b545eeb2e3eb862f39d1846a31d17a0a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit f7ec1cd5cc7ef3ad964b677ba82b8b77f1c93009 ]
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call
getrusage() at the same time and the process has NR_THREADS, spin_lock_irq
will spin with irqs disabled O(NR_CPUS * NR_THREADS) time.
Change getrusage() to use sig->stats_lock, it was specifically designed
for this type of use. This way it runs lockless in the likely case.
TODO:
- Change do_task_stat() to use sig->stats_lock too, then we can
remove spin_lock_irq(siglock) in wait_task_zombie().
- Turn sig->stats_lock into seqcount_rwlock_t, this way the
readers in the slow mode won't exclude each other. See
https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/
- stats_lock has to disable irqs because ->siglock can be taken
in irq context, it would be very nice to change __exit_signal()
to avoid the siglock->stats_lock dependency.
Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 13b7bc60b5353371460a203df6c38ccd38ad7a3a ]
do/while_each_thread should be avoided when possible.
Plus this change allows to avoid lock_task_sighand(), we can use rcu
and/or sig->stats_lock instead.
Link: https://lkml.kernel.org/r/20230909172629.GA20454@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: f7ec1cd5cc7e ("getrusage: use sig->stats_lock rather than lock_task_sighand()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit daa694e4137571b4ebec330f9a9b4d54aa8b8089 ]
Patch series "getrusage: use sig->stats_lock", v2.
This patch (of 2):
thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.
This is also preparation for the next patch which changes getrusage() to
use stats_lock instead of siglock, thread_group_cputime() takes the same
lock. With the current implementation recursive read_seqbegin_or_lock()
is fine, thread_group_cputime() can't enter the slow mode if the caller
holds stats_lock, yet this looks more safe and better performance-wise.
Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com
Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.179
ARM: dts: rockchip: fix a typo error for rk3288 spdif node
arm64: dts: qcom: ipq8074-hk01: enable QMP device, not the PHY node
arm64: dts: meson-g12-common: specify full DMC range
arm64: dts: imx8mm-evk: correct pmic clock source
netfilter: br_netfilter: fix recent physdev match breakage
regulator: fan53555: Explicitly include bits header
net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
virtio_net: bugfix overflow inside xdp_linearize_page()
sfc: Split STATE_READY in to STATE_NET_DOWN and STATE_NET_UP.
sfc: Fix use-after-free due to selftest_work
netfilter: nf_tables: fix ifdef to also consider nf_tables=m
i40e: fix accessing vsi->active_filters without holding lock
i40e: fix i40e_setup_misc_vector() error handling
mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next()
net: rpl: fix rpl header size calculation
mlxsw: pci: Fix possible crash during initialization
bpf: Fix incorrect verifier pruning due to missing register precision taints
e1000e: Disable TSO on i219-LM card to increase speed
f2fs: Fix f2fs_truncate_partial_nodes ftrace event
Input: i8042 - add quirk for Fujitsu Lifebook A574/H
selftests: sigaltstack: fix -Wuninitialized
scsi: megaraid_sas: Fix fw_crash_buffer_show()
scsi: core: Improve scsi_vpd_inquiry() checks
net: dsa: b53: mmap: add phy ops
s390/ptrace: fix PTRACE_GET_LAST_BREAK error handling
nvme-tcp: fix a possible UAF when failing to allocate an io queue
xen/netback: use same error messages for same errors
powerpc/doc: Fix htmldocs errors
xfs: drop submit side trans alloc for append ioends
iio: light: tsl2772: fix reading proximity-diodes from device tree
nilfs2: initialize unused bytes in segment summary blocks
memstick: fix memory leak if card device is never registered
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
mmc: sdhci_am654: Set HIGH_SPEED_ENA for SDR12 and SDR25
mm/khugepaged: check again on anon uffd-wp during isolation
sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
sched/uclamp: Fix fits_capacity() check in feec()
sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
sched/fair: Detect capacity inversion
sched/fair: Consider capacity inversion in util_fits_cpu()
sched/uclamp: Fix a uninitialized variable warnings
sched/fair: Fixes for capacity inversion detection
MIPS: Define RUNTIME_DISCARD_EXIT in LD script
docs: futex: Fix kernel-doc references after code split-up preparation
purgatory: fix disabling debug info
virtiofs: clean up error handling in virtio_fs_get_tree()
virtiofs: split requests that exceed virtqueue size
fuse: check s_root when destroying sb
fuse: fix attr version comparison in fuse_read_update_size()
fuse: always revalidate rename target dentry
fuse: fix deadlock between atomic O_TRUNC and page invalidation
Revert "ext4: fix use-after-free in ext4_xattr_set_entry"
ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()
ext4: fix use-after-free in ext4_xattr_set_entry
udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().
dccp: Call inet6_destroy_sock() via sk->sk_destruct().
sctp: Call inet6_destroy_sock() via sk->sk_destruct().
pwm: meson: Explicitly set .polarity in .get_state()
pwm: iqs620a: Explicitly set .polarity in .get_state()
pwm: hibvt: Explicitly set .polarity in .get_state()
iio: adc: at91-sama5d2_adc: fix an error code in at91_adc_allocate_trigger()
ASoC: fsl_asrc_dma: fix potential null-ptr-deref
ASN.1: Fix check for strdup() success
Linux 5.10.179
Change-Id: I54e476aa9b199a4711a091c77583739ed82af5ad
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 659c0ce1cb9efc7f58d380ca4bb2a51ae9e30553 upstream.
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.165
btrfs: fix trace event name typo for FLUSH_DELAYED_REFS
pNFS/filelayout: Fix coalescing test for single DS
selftests/bpf: check null propagation only neither reg is PTR_TO_BTF_ID
tools/virtio: initialize spinlocks in vring_test.c
net/ethtool/ioctl: return -EOPNOTSUPP if we have no phy stats
RDMA/srp: Move large values to a new enum for gcc13
btrfs: always report error in run_one_delayed_ref()
x86/asm: Fix an assembler warning with current binutils
f2fs: let's avoid panic if extent_tree is not created
wifi: brcmfmac: fix regression for Broadcom PCIe wifi devices
wifi: mac80211: sdata can be NULL during AMPDU start
Add exception protection processing for vd in axi_chan_handle_err function
zonefs: Detect append writes at invalid locations
nilfs2: fix general protection fault in nilfs_btree_insert()
efi: fix userspace infinite retry read efivars after EFI runtime services page fault
ALSA: hda/realtek - Turn on power early
drm/i915/gt: Reset twice
Bluetooth: hci_qca: Wait for timeout during suspend
Bluetooth: hci_qca: Fix driver shutdown on closed serdev
io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
io_uring: improve send/recv error handling
io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly
io_uring: add flag for disabling provided buffer recycling
io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)
io_uring: allow re-poll if we made progress
io_uring: fix async accept on O_NONBLOCK sockets
io_uring: check for valid register opcode earlier
io_uring: lock overflowing for IOPOLL
io_uring: fix CQ waiting timeout handling
io_uring: ensure that cached task references are always put on exit
io_uring: remove duplicated calls to io_kiocb_ppos
io_uring: update kiocb->ki_pos at execution time
io_uring: do not recalculate ppos unnecessarily
io_uring/rw: defer fsnotify calls to task context
xhci-pci: set the dma max_seg_size
usb: xhci: Check endpoint is valid before dereferencing it
xhci: Fix null pointer dereference when host dies
xhci: Add update_hub_device override for PCI xHCI hosts
xhci: Add a flag to disable USB3 lpm on a xhci root port level.
usb: acpi: add helper to check port lpm capability using acpi _DSM
xhci: Detect lpm incapable xHC USB3 roothub ports from ACPI tables
prlimit: do_prlimit needs to have a speculation check
USB: serial: option: add Quectel EM05-G (GR) modem
USB: serial: option: add Quectel EM05-G (CS) modem
USB: serial: option: add Quectel EM05-G (RS) modem
USB: serial: option: add Quectel EC200U modem
USB: serial: option: add Quectel EM05CN (SG) modem
USB: serial: option: add Quectel EM05CN modem
staging: vchiq_arm: fix enum vchiq_status return types
USB: misc: iowarrior: fix up header size for USB_DEVICE_ID_CODEMERCS_IOW100
misc: fastrpc: Don't remove map on creater_process and device_release
misc: fastrpc: Fix use-after-free race condition for maps
usb: core: hub: disable autosuspend for TI TUSB8041
comedi: adv_pci1760: Fix PWM instruction handling
mmc: sunxi-mmc: Fix clock refcount imbalance during unbind
mmc: sdhci-esdhc-imx: correct the tuning start tap and step setting
btrfs: fix race between quota rescan and disable leading to NULL pointer deref
cifs: do not include page data when checking signature
thunderbolt: Use correct function to calculate maximum USB3 link rate
tty: serial: qcom-geni-serial: fix slab-out-of-bounds on RX FIFO buffer
USB: gadgetfs: Fix race between mounting and unmounting
USB: serial: cp210x: add SCALANCE LPE-9000 device id
usb: host: ehci-fsl: Fix module alias
usb: typec: altmodes/displayport: Add pin assignment helper
usb: typec: altmodes/displayport: Fix pin assignment calculation
usb: gadget: g_webcam: Send color matching descriptor per frame
usb: gadget: f_ncm: fix potential NULL ptr deref in ncm_bitrate()
usb-storage: apply IGNORE_UAS only for HIKSEMI MD202 on RTL9210
dt-bindings: phy: g12a-usb2-phy: fix compatible string documentation
dt-bindings: phy: g12a-usb3-pcie-phy: fix compatible string documentation
serial: pch_uart: Pass correct sg to dma_unmap_sg()
dmaengine: tegra210-adma: fix global intr clear
serial: atmel: fix incorrect baudrate setup
gsmi: fix null-deref in gsmi_get_variable
mei: me: add meteor lake point M DID
drm/i915: re-disable RC6p on Sandy Bridge
drm/amd/display: Fix set scaling doesn's work
drm/amd/display: Calculate output_color_space after pixel encoding adjustment
drm/amd/display: Fix COLOR_SPACE_YCBCR2020_TYPE matrix
arm64: efi: Execute runtime services from a dedicated stack
efi: rt-wrapper: Add missing include
Revert "drm/amdgpu: make display pinning more flexible (v2)"
x86/fpu: Use _Alignof to avoid undefined behavior in TYPE_ALIGN
tracing: Use alignof__(struct {type b;}) instead of offsetof()
io_uring: io_kiocb_update_pos() should not touch file for non -1 offset
io_uring/net: fix fast_iov assignment in io_setup_async_msg()
net/ulp: use consistent error code when blocking ULP
net/mlx5: fix missing mutex_unlock in mlx5_fw_fatal_reporter_err_work()
Revert "wifi: mac80211: fix memory leak in ieee80211_if_add()"
soc: qcom: apr: Make qcom,protection-domain optional again
Bluetooth: hci_qca: Wait for SSR completion during suspend
Bluetooth: hci_qca: check for SSR triggered flag while suspend
Bluetooth: hci_qca: Fixed issue during suspend
mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma
io_uring: Clean up a false-positive warning from GCC 9.3.0
io_uring: fix double poll leak on repolling
io_uring/rw: ensure kiocb_end_write() is always called
io_uring/rw: remove leftover debug statement
Linux 5.10.165
Change-Id: Icb91157d9fa1b56cd79eedb8a9cc6118d0705244
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 739790605705ddcf18f21782b9c99ad7d53a8c11 upstream.
do_prlimit() adds the user-controlled resource value to a pointer that
will subsequently be dereferenced. In order to help prevent this
codepath from being used as a spectre "gadget" a barrier needs to be
added after checking the range.
Reported-by: Jordy Zomer <jordyzomer@google.com>
Tested-by: Jordy Zomer <jordyzomer@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.69
PCI: pci-bridge-emul: Add PCIe Root Capabilities Register
PCI: aardvark: Fix reporting CRS value
console: consume APC, DM, DCS
s390/pci_mmio: fully validate the VMA before calling follow_pte()
ARM: Qualify enabling of swiotlb_init()
ARM: 9077/1: PLT: Move struct plt_entries definition to header
ARM: 9078/1: Add warn suppress parameter to arm_gen_branch_link()
ARM: 9079/1: ftrace: Add MODULE_PLTS support
ARM: 9098/1: ftrace: MODULE_PLT: Fix build problem without DYNAMIC_FTRACE
Revert "net/mlx5: Register to devlink ingress VLAN filter trap"
sctp: validate chunk size in __rcv_asconf_lookup
sctp: add param size validation for SCTP_PARAM_SET_PRIMARY
staging: rtl8192u: Fix bitwise vs logical operator in TranslateRxSignalStuff819xUsb()
coredump: fix memleak in dump_vma_snapshot()
um: virtio_uml: fix memory leak on init failures
dmaengine: acpi: Avoid comparison GSI with Linux vIRQ
perf test: Fix bpf test sample mismatch reporting
tools lib: Adopt memchr_inv() from kernel
perf tools: Allow build-id with trailing zeros
thermal/drivers/exynos: Fix an error code in exynos_tmu_probe()
9p/trans_virtio: Remove sysfs file on probe failure
prctl: allow to setup brk for et_dyn executables
nilfs2: use refcount_dec_and_lock() to fix potential UAF
profiling: fix shift-out-of-bounds bugs
PM: sleep: core: Avoid setting power.must_resume to false
pwm: lpc32xx: Don't modify HW state in .probe() after the PWM chip was registered
pwm: mxs: Don't modify HW state in .probe() after the PWM chip was registered
dmaengine: idxd: fix wq slot allocation index check
platform/chrome: sensorhub: Add trace events for sample
platform/chrome: cros_ec_trace: Fix format warnings
ceph: allow ceph_put_mds_session to take NULL or ERR_PTR
ceph: cancel delayed work instead of flushing on mdsc teardown
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh
thermal/core: Fix thermal_cooling_device_register() prototype
drm/amdgpu: Disable PCIE_DPM on Intel RKL Platform
drivers: base: cacheinfo: Get rid of DEFINE_SMP_CALL_CACHE_FUNCTION()
dma-buf: DMABUF_MOVE_NOTIFY should depend on DMA_SHARED_BUFFER
parisc: Move pci_dev_is_behind_card_dino to where it is used
iommu/amd: Relocate GAMSup check to early_enable_iommus
dmaengine: idxd: depends on !UML
dmaengine: sprd: Add missing MODULE_DEVICE_TABLE
dmaengine: ioat: depends on !UML
dmaengine: xilinx_dma: Set DMA mask for coherent APIs
ceph: request Fw caps before updating the mtime in ceph_write_iter
ceph: remove the capsnaps when removing caps
ceph: lockdep annotations for try_nonblocking_invalidate
btrfs: update the bdev time directly when closing
btrfs: fix lockdep warning while mounting sprout fs
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
habanalabs: add validity check for event ID received from F/W
pwm: img: Don't modify HW state in .remove() callback
pwm: rockchip: Don't modify HW state in .remove() callback
pwm: stm32-lp: Don't modify HW state in .remove() callback
blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
rtc: rx8010: select REGMAP_I2C
sched/idle: Make the idle timer expire in hard interrupt context
drm/nouveau/nvkm: Replace -ENOSYS with -ENODEV
Linux 5.10.69
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I982349e3a65b83e92e9b808154bf8c84d094f1d6
commit e1fbbd073137a9d63279f6bf363151a938347640 upstream.
Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.
For example a test program shows
| # ~/t
|
| start_code 401000
| end_code 401a15
| start_stack 7ffce4577dd0
| start_data 403e10
| end_data 40408c
| start_brk b5b000
| sbrk(0) b5b000
and when executed via ld-linux
| # /lib64/ld-linux-x86-64.so.2 ~/t
|
| start_code 7fc25b0a4000
| end_code 7fc25b0c4524
| start_stack 7fffcc6b2400
| start_data 7fc25b0ce4c0
| end_data 7fc25b0cff98
| start_brk 55555710c000
| sbrk(0) 55555710c000
This of course prevent criu from restoring such programs. Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data. Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.
Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain
Fixes: bbdc6076d2 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Andrey Vagin <avagin@gmail.com>
Tested-by: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.
The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.
This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.
On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.
On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Bug: 192536783
(cherry picked from commit 201698626fbca1cf1a3b686ba14cf2a056500716)
Change-Id: Ic0a21c92a22575f9ec3599fb67bd2931a50b9f04
[quic_eberman@quicinc.com: Resolved merge conflict in
arch/arm64/kernel/process.c]
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Collingbourne <pcc@google.com>
[ Upstream commit 905ae01c4ae2ae3df05bb141801b1db4b7d83c61 ]
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.
Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().
Changelog
v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
error was caused by the fact that cred_alloc_blank() left the ucounts
pointer empty.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Steps on the way to 5.10-rc1
Resolves conflicts in:
fs/userfaultfd.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fe3c818f1f6565cfd4fa551de72d2b72ef60af
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in
fact a "pointer to (pointer to int in userspace) in userspace". So
sparse rightfully complains about passing a kernel pointer to
put_user().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull SafeSetID updates from Micah Morton:
"The changes are mostly contained to within the SafeSetID LSM, with the
exception of a few 1-line changes to change some ns_capable() calls to
ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that
is examined by SafeSetID code and nothing else in the kernel.
The changes to SafeSetID internally allow for setting up GID
transition security policies, as already existed for UIDs"
* tag 'safesetid-5.10' of git://github.com/micah-morton/linux:
LSM: SafeSetID: Fix warnings reported by test bot
LSM: SafeSetID: Add GID security policy handling
LSM: Signal to SafeSetID when setting group IDs
For SafeSetID to properly gate set*gid() calls, it needs to know whether
ns_capable() is being called from within a sys_set*gid() function or is
being called from elsewhere in the kernel. This allows SafeSetID to deny
CAP_SETGID to restricted groups when they are attempting to use the
capability for code paths other than updating GIDs (e.g. setting up
userns GID mappings). This is the identical approach to what is
currently done for CAP_SETUID.
NOTE: We also add signaling to SafeSetID from the setgroups() syscall,
as we have future plans to restrict a process' ability to set
supplementary groups in addition to what is added in this series for
restricting setting of the primary group.
Signed-off-by: Thomas Cedeno <thomascedeno@google.com>
Signed-off-by: Micah Morton <mortonm@chromium.org>
Steps on the way to 5.9-rc1
Resolves conflicts in:
drivers/irqchip/qcom-pdc.c
include/linux/device.h
net/xfrm/xfrm_state.c
security/lsm_audit.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4aeb3d04f4717714a421721eb3ce690c099bb30a
Originally, only a local CAP_SYS_ADMIN could change the exe link,
making it difficult for doing checkpoint/restore without CAP_SYS_ADMIN.
This commit adds CAP_CHECKPOINT_RESTORE in addition to CAP_SYS_ADMIN
for permitting changing the exe link.
The following describes the history of the /proc/self/exe permission
checks as it may be difficult to understand what decisions lead to this
point.
* [1] May 2012: This commit introduces the ability of changing
/proc/self/exe if the user is CAP_SYS_RESOURCE capable.
In the related discussion [2], no clear thread model is presented for
what could happen if the /proc/self/exe changes multiple times, or why
would the admin be at the mercy of userspace.
* [3] Oct 2014: This commit introduces a new API to change
/proc/self/exe. The permission no longer checks for CAP_SYS_RESOURCE,
but instead checks if the current user is root (uid=0) in its local
namespace. In the related discussion [4] it is said that "Controlling
exe_fd without privileges may turn out to be dangerous. At least
things like tomoyo examine it for making policy decisions (see
tomoyo_manager())."
* [5] Dec 2016: This commit removes the restriction to change
/proc/self/exe at most once. The related discussion [6] informs that
the audit subsystem relies on the exe symlink, presumably
audit_log_d_path_exe() in kernel/audit.c.
* [7] May 2017: This commit changed the check from uid==0 to local
CAP_SYS_ADMIN. No discussion.
* [8] July 2020: A PoC to spoof any program's /proc/self/exe via ptrace
is demonstrated
Overall, the concrete points that were made to retain capability checks
around changing the exe symlink is that tomoyo_manager() and
audit_log_d_path_exe() uses the exe_file path.
Christian Brauner said that relying on /proc/<pid>/exe being immutable (or
guarded by caps) in a sake of security is a bit misleading. It can only
be used as a hint without any guarantees of what code is being executed
once execve() returns to userspace. Christian suggested that in the
future, we could call audit_log() or similar to inform the admin of all
exe link changes, instead of attempting to provide security guarantees
via permission checks. However, this proposed change requires the
understanding of the security implications in the tomoyo/audit subsystems.
[1] b32dfe3771 ("c/r: prctl: add ability to set new mm_struct::exe_file")
[2] https://lore.kernel.org/patchwork/patch/292515/
[3] f606b77f1a ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
[4] https://lore.kernel.org/patchwork/patch/479359/
[5] 3fb4afd9a5 ("prctl: remove one-shot limitation for changing exe link")
[6] https://lore.kernel.org/patchwork/patch/697304/
[7] 4d28df6152 ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
[8] https://github.com/nviennot/run_as_exe
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-6-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
It's now being abstracted away, so fix up the ANDROID specific code that
touches the lock to use the correct functions instead.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9cedb62118cd1bba7af27deb11499d312d24d7fc
Pull SafeSetID update from Micah Morton:
"Add additional LSM hooks for SafeSetID
SafeSetID is capable of making allow/deny decisions for set*uid calls
on a system, and we want to add similar functionality for set*gid
calls.
The work to do that is not yet complete, so probably won't make it in
for v5.8, but we are looking to get this simple patch in for v5.8
since we have it ready.
We are planning on the rest of the work for extending the SafeSetID
LSM being merged during the v5.9 merge window"
* tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
security: Add LSM hooks to set*gid syscalls
The SafeSetID LSM uses the security_task_fix_setuid hook to filter
set*uid() syscalls according to its configured security policy. In
preparation for adding analagous support in the LSM for set*gid()
syscalls, we add the requisite hook here. Tested by putting print
statements in the security_task_fix_setgid hook and seeing them get hit
during kernel boot.
Signed-off-by: Thomas Cedeno <thomascedeno@google.com>
Signed-off-by: Micah Morton <mortonm@chromium.org>
Tiny merge resolutions along the way to 5.8-rc1.
Change-Id: I24b3cca28ed36f32c92b6374dae5d7f006d3bced
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Merge updates from Andrew Morton:
"A few little subsystems and a start of a lot of MM patches.
Subsystems affected by this patch series: squashfs, ocfs2, parisc,
vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
swap, memcg, pagemap, memory-failure, vmalloc, kasan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
kasan: move kasan_report() into report.c
mm/mm_init.c: report kasan-tag information stored in page->flags
ubsan: entirely disable alignment checks under UBSAN_TRAP
kasan: fix clang compilation warning due to stack protector
x86/mm: remove vmalloc faulting
mm: remove vmalloc_sync_(un)mappings()
x86/mm/32: implement arch_sync_kernel_mappings()
x86/mm/64: implement arch_sync_kernel_mappings()
mm/ioremap: track which page-table levels were modified
mm/vmalloc: track which page-table levels were modified
mm: add functions to track page directory modifications
s390: use __vmalloc_node in stack_alloc
powerpc: use __vmalloc_node in alloc_vm_stack
arm64: use __vmalloc_node in arch_alloc_vmap_stack
mm: remove vmalloc_user_node_flags
mm: switch the test_vmalloc module to use __vmalloc_node
mm: remove __vmalloc_node_flags_caller
mm: remove both instances of __vmalloc_node_flags
mm: remove the prot argument to __vmalloc_node
mm: remove the pgprot argument to __vmalloc
...
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).
The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled. The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.
This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.
The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.
This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi. This ensures it will always be
able to have some pages in flight, and so will not deadlock.
In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.
This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.
So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless the
global and the local free-run thresholds are exceeded.
Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks. I don't know what is wanted for realtime.
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baby steps in the 5.6-rc1 merge cycle to make things easier to review
and debug.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0fa183764fd1adbde44e8181f0b3df6cff4da18b
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.
In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.
Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory. Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.
[ 1626.609191] msgr-worker-1 D 0 1026 1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195] ? __schedule+0x29b/0x630
[ 1626.609197] ? wait_for_completion+0xe0/0x170
[ 1626.609198] schedule+0x30/0xb0
[ 1626.609200] schedule_timeout+0x1f6/0x2f0
[ 1626.609202] ? blk_finish_plug+0x21/0x2e
[ 1626.609204] ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206] ? wait_for_completion+0xe0/0x170
[ 1626.609208] wait_for_completion+0x108/0x170
[ 1626.609210] ? wake_up_q+0x70/0x70
[ 1626.609212] ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214] ? xfs_bwrite+0x25/0x60
[ 1626.609215] xfs_buf_iowait+0x22/0xf0
[ 1626.609218] __xfs_buf_submit+0x12e/0x250
[ 1626.609220] xfs_bwrite+0x25/0x60
[ 1626.609222] xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224] xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227] xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228] super_cache_scan+0x152/0x1a0
[ 1626.609231] do_shrink_slab+0x12c/0x2d0
[ 1626.609233] shrink_slab+0x9c/0x2a0
[ 1626.609235] shrink_node+0xd7/0x470
[ 1626.609237] do_try_to_free_pages+0xbf/0x380
[ 1626.609240] try_to_free_pages+0xd9/0x1f0
[ 1626.609245] __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251] ? ___slab_alloc+0x238/0x560
[ 1626.609254] __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259] skb_page_frag_refill+0x97/0xd0
[ 1626.609274] sk_page_frag_refill+0x1d/0x80
[ 1626.609279] tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304] tcp_sendmsg+0x27/0x40
[ 1626.609307] sock_sendmsg+0x54/0x60
[ 1626.609308] ___sys_sendmsg+0x29f/0x320
[ 1626.609313] ? sock_poll+0x66/0xb0
[ 1626.609318] ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320] ? ep_send_events_proc+0xe6/0x230
[ 1626.609322] ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324] ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326] ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327] ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329] ? __hrtimer_init+0xb0/0xb0
[ 1626.609331] ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334] ? ep_poll+0x26c/0x4a0
[ 1626.609337] ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339] ? release_sock+0x43/0x90
[ 1626.609341] ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342] __sys_sendmsg+0x47/0x80
[ 1626.609347] do_syscall_64+0x5f/0x1c0
[ 1626.609349] ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351] entry_SYSCALL_64_after_hwframe+0x44/0xa9
This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
There are two 'struct timeval' fields in 'struct rusage'.
Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.
While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.
In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This merges Linus's tree as of commit b41dae061b ("Merge tag
'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")
into android-mainline.
This "early" merge makes it easier to test and handle merge conflicts
instead of having to wait until the "end" of the merge window and handle
all 10000+ commits at once.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6bebf55e5e2353f814e3c87f5033607b1ae5d812
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
Pull x86 cpu-feature updates from Ingo Molnar:
- Rework the Intel model names symbols/macros, which were decades of
ad-hoc extensions and added random noise. It's now a coherent, easy
to follow nomenclature.
- Add new Intel CPU model IDs:
- "Tiger Lake" desktop and mobile models
- "Elkhart Lake" model ID
- and the "Lightning Mountain" variant of Airmont, plus support code
- Add the new AVX512_VP2INTERSECT instruction to cpufeatures
- Remove Intel MPX user-visible APIs and the self-tests, because the
toolchain (gcc) is not supporting it going forward. This is the
first, lowest-risk phase of MPX removal.
- Remove X86_FEATURE_MFENCE_RDTSC
- Various smaller cleanups and fixes
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
x86/cpu: Update init data for new Airmont CPU model
x86/cpu: Add new Airmont variant to Intel family
x86/cpu: Add Elkhart Lake to Intel family
x86/cpu: Add Tiger Lake to Intel family
x86: Correct misc typos
x86/intel: Add common OPTDIFFs
x86/intel: Aggregate microserver naming
x86/intel: Aggregate big core graphics naming
x86/intel: Aggregate big core mobile naming
x86/intel: Aggregate big core client naming
x86/cpufeature: Explain the macro duplication
x86/ftrace: Remove mcount() declaration
x86/PCI: Remove superfluous returns from void functions
x86/msr-index: Move AMD MSRs where they belong
x86/cpu: Use constant definitions for CPU models
lib: Remove redundant ftrace flag removal
x86/crash: Remove unnecessary comparison
x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
x86: Remove X86_FEATURE_MFENCE_RDTSC
x86/mpx: Remove MPX APIs
...
Deactivation of the expiry cache is done by setting all clock caches to
0. That requires to have a check for zero in all places which update the
expiry cache:
if (cache == 0 || new < cache)
cache = new;
Use U64_MAX as the deactivated value, which allows to remove the zero
checks when updating the cache and reduces it to the obvious check:
if (new < cache)
cache = new;
This also removes the weird workaround in do_prlimit() which was required
to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
in 0 to the posix CPU timer code would have effectively disarmed it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de