This reverts commit ed5d4efb4d which is
commit e4a38402c36e42df28eb1a5394be87e6571fb48a upstream.
It breaks the kernel ABI and should not be an issue for Android at this
point in time. If it is, it can come back in a different, abi-stable
form.
Bug: 161946584
Fixes: ed5d4efb4d ("oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup")
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I729614c373a7b28546376913e719bbaf6dd306a9
Changes in 5.10.113
etherdevice: Adjust ether_addr* prototypes to silence -Wstringop-overead
mm: page_alloc: fix building error on -Werror=array-compare
tracing: Dump stacktrace trigger to the corresponding instance
perf tools: Fix segfault accessing sample_id xyarray
gfs2: assign rgrp glock before compute_bitstructs
net/sched: cls_u32: fix netns refcount changes in u32_change()
ALSA: usb-audio: Clear MIDI port active flag after draining
ALSA: hda/realtek: Add quirk for Clevo NP70PNP
dm: fix mempool NULL pointer race when completing IO
ASoC: atmel: Remove system clock tree configuration for at91sam9g20ek
ASoC: msm8916-wcd-digital: Check failure for devm_snd_soc_register_component
ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
dmaengine: imx-sdma: Fix error checking in sdma_event_remap
dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources
spi: spi-mtk-nor: initialize spi controller after resume
esp: limit skb_page_frag_refill use to a single page
igc: Fix infinite loop in release_swfw_sync
igc: Fix BUG: scheduling while atomic
rxrpc: Restore removed timer deletion
net/smc: Fix sock leak when release after smc_shutdown()
net/packet: fix packet_sock xmit return value checking
ip6_gre: Avoid updating tunnel->tun_hlen in __gre6_xmit()
ip6_gre: Fix skb_under_panic in __gre6_xmit()
net/sched: cls_u32: fix possible leak in u32_init_knode()
l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
ipv6: make ip6_rt_gc_expire an atomic_t
netlink: reset network and mac headers in netlink_dump()
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
dmaengine: idxd: add RO check for wq max_batch_size write
dmaengine: idxd: add RO check for wq max_transfer_size write
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
arm64/mm: Remove [PUD|PMD]_TABLE_BIT from [pud|pmd]_bad()
arm64: mm: fix p?d_leaf()
ARM: vexpress/spc: Avoid negative array index when !SMP
reset: tegra-bpmp: Restore Handle errors in BPMP response
platform/x86: samsung-laptop: Fix an unsigned comparison which can never be negative
ALSA: usb-audio: Fix undefined behavior due to shift overflowing the constant
arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
vxlan: fix error return code in vxlan_fdb_append
cifs: Check the IOCB_DIRECT flag, not O_DIRECT
net: atlantic: Avoid out-of-bounds indexing
mt76: Fix undefined behavior due to shift overflowing the constant
brcmfmac: sdio: Fix undefined behavior due to shift overflowing the constant
dpaa_eth: Fix missing of_node_put in dpaa_get_ts_info()
drm/msm/mdp5: check the return of kzalloc()
net: macb: Restart tx only if queue pointer is lagging
scsi: qedi: Fix failed disconnect handling
stat: fix inconsistency between struct stat and struct compat_stat
nvme: add a quirk to disable namespace identifiers
nvme-pci: disable namespace identifiers for Qemu controllers
EDAC/synopsys: Read the error count from the correct register
mm, hugetlb: allow for "high" userspace addresses
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
ata: pata_marvell: Check the 'bmdma_addr' beforing reading
dma: at_xdmac: fix a missing check on list iterator
net: atlantic: invert deep par in pm functions, preventing null derefs
xtensa: patch_text: Fixup last cpu should be master
xtensa: fix a7 clobbering in coprocessor context load/store
openvswitch: fix OOB access in reserve_sfa_size()
gpio: Request interrupts after IRQ is initialized
ASoC: soc-dapm: fix two incorrect uses of list iterator
e1000e: Fix possible overflow in LTR decoding
ARC: entry: fix syscall_trace_exit argument
arm_pmu: Validate single/group leader events
sched/pelt: Fix attach_entity_load_avg() corner case
perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled
drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised
drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare
KVM: PPC: Fix TCE handling for VFIO
drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
powerpc/perf: Fix power9 event alternatives
perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
ext4: fix fallocate to use file_modified to update permissions consistently
ext4: fix symlink file size not match to file content
ext4: fix use-after-free in ext4_search_dir
ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
ext4, doc: fix incorrect h_reserved size
ext4: fix overhead calculation to account for the reserved gdt blocks
ext4: force overhead calculation if the s_overhead_cluster makes no sense
can: isotp: stop timeout monitoring when no first frame was sent
jbd2: fix a potential race while discarding reserved buffers after an abort
spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller
staging: ion: Prevent incorrect reference counting behavour
block/compat_ioctl: fix range check in BLKGETSIZE
Revert "net: micrel: fix KS8851_MLL Kconfig"
Linux 5.10.113
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4ed10699cbb32b89caf79b8b4a2a35b3d8824115
Changes in 5.10.102
drm/nouveau/pmu/gm200-: use alternate falcon reset sequence
mm: memcg: synchronize objcg lists with a dedicated spinlock
rcu: Do not report strict GPs for outgoing CPUs
fget: clarify and improve __fget_files() implementation
fs/proc: task_mmu.c: don't read mapcount for migration entry
can: isotp: prevent race between isotp_bind() and isotp_setsockopt()
can: isotp: add SF_BROADCAST support for functional addressing
scsi: lpfc: Fix mailbox command failure during driver initialization
HID:Add support for UGTABLET WP5540
Revert "svm: Add warning message for AVIC IPI invalid target"
serial: parisc: GSC: fix build when IOSAPIC is not set
parisc: Drop __init from map_pages declaration
parisc: Fix data TLB miss in sba_unmap_sg
parisc: Fix sglist access in ccio-dma.c
mmc: block: fix read single on recovery logic
mm: don't try to NUMA-migrate COW pages that have other uses
PCI: hv: Fix NUMA node assignment when kernel boots with custom NUMA topology
parisc: Add ioread64_lo_hi() and iowrite64_lo_hi()
btrfs: send: in case of IO error log it
platform/x86: touchscreen_dmi: Add info for the RWC NANOTE P8 AY07J 2-in-1
platform/x86: ISST: Fix possible circular locking dependency detected
selftests: rtc: Increase test timeout so that all tests run
kselftest: signal all child processes
net: ieee802154: at86rf230: Stop leaking skb's
selftests/zram: Skip max_comp_streams interface on newer kernel
selftests/zram01.sh: Fix compression ratio calculation
selftests/zram: Adapt the situation that /dev/zram0 is being used
selftests: openat2: Print also errno in failure messages
selftests: openat2: Add missing dependency in Makefile
selftests: openat2: Skip testcases that fail with EOPNOTSUPP
selftests: skip mincore.check_file_mmap when fs lacks needed support
ax25: improve the incomplete fix to avoid UAF and NPD bugs
vfs: make freeze_super abort when sync_filesystem returns error
quota: make dquot_quota_sync return errors from ->sync_fs
scsi: pm8001: Fix use-after-free for aborted TMF sas_task
scsi: pm8001: Fix use-after-free for aborted SSP/STP sas_task
nvme: fix a possible use-after-free in controller reset during load
nvme-tcp: fix possible use-after-free in transport error_recovery work
nvme-rdma: fix possible use-after-free in transport error_recovery work
drm/amdgpu: fix logic inversion in check
x86/Xen: streamline (and fix) PV CPU enumeration
Revert "module, async: async_synchronize_full() on module init iff async is used"
gcc-plugins/stackleak: Use noinstr in favor of notrace
random: wake up /dev/random writers after zap
kbuild: lto: merge module sections
kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled
iwlwifi: fix use-after-free
drm/radeon: Fix backlight control on iMac 12,1
drm/i915/opregion: check port number bounds for SWSCI display power state
vsock: remove vsock from connected table when connect is interrupted by a signal
drm/i915/gvt: Make DRM_I915_GVT depend on X86
iwlwifi: pcie: fix locking when "HW not ready"
iwlwifi: pcie: gen2: fix locking when "HW not ready"
selftests: netfilter: fix exit value for nft_concat_range
netfilter: nft_synproxy: unregister hooks on init error path
ipv6: per-netns exclusive flowlabel checks
net: dsa: lan9303: fix reset on probe
net: dsa: lantiq_gswip: fix use after free in gswip_remove()
net: ieee802154: ca8210: Fix lifs/sifs periods
ping: fix the dif and sdif check in ping_lookup
bonding: force carrier update when releasing slave
drop_monitor: fix data-race in dropmon_net_event / trace_napi_poll_hit
net_sched: add __rcu annotation to netdev->qdisc
bonding: fix data-races around agg_select_timer
libsubcmd: Fix use-after-free for realloc(..., 0)
dpaa2-eth: Initialize mutex used in one step timestamping path
perf bpf: Defer freeing string after possible strlen() on it
selftests/exec: Add non-regular to TEST_GEN_PROGS
ALSA: hda/realtek: Add quirk for Legion Y9000X 2019
ALSA: hda/realtek: Fix deadlock by COEF mutex
ALSA: hda: Fix regression on forced probe mask option
ALSA: hda: Fix missing codec probe on Shenker Dock 15
ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw()
ASoC: ops: Fix stereo change notifications in snd_soc_put_volsw_range()
powerpc/lib/sstep: fix 'ptesync' build error
mtd: rawnand: gpmi: don't leak PM reference in error path
KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests
ASoC: tas2770: Insert post reset delay
block/wbt: fix negative inflight counter when remove scsi device
NFS: LOOKUP_DIRECTORY is also ok with symlinks
NFS: Do not report writeback errors in nfs_getattr()
tty: n_tty: do not look ahead for EOL character past the end of the buffer
mtd: rawnand: qcom: Fix clock sequencing in qcom_nandc_probe()
mtd: rawnand: brcmnand: Fixed incorrect sub-page ECC status
Drivers: hv: vmbus: Fix memory leak in vmbus_add_channel_kobj
KVM: x86/pmu: Refactoring find_arch_event() to pmc_perf_hw_id()
KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
ARM: OMAP2+: hwmod: Add of_node_put() before break
ARM: OMAP2+: adjust the location of put_device() call in omapdss_init_of
phy: usb: Leave some clocks running during suspend
irqchip/sifive-plic: Add missing thead,c900-plic match string
netfilter: conntrack: don't refresh sctp entries in closed state
arm64: dts: meson-gx: add ATF BL32 reserved-memory region
arm64: dts: meson-g12: add ATF BL32 reserved-memory region
arm64: dts: meson-g12: drop BL32 region from SEI510/SEI610
pidfd: fix test failure due to stack overflow on some arches
selftests: fixup build warnings in pidfd / clone3 tests
kconfig: let 'shell' return enough output for deep path names
lib/iov_iter: initialize "flags" in new pipe_buffer
ata: libata-core: Disable TRIM on M88V29
soc: aspeed: lpc-ctrl: Block error printing on probe defer cases
xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create
drm/rockchip: dw_hdmi: Do not leave clock enabled in error case
tracing: Fix tp_printk option related with tp_printk_stop_on_boot
net: usb: qmi_wwan: Add support for Dell DW5829e
net: macb: Align the dma and coherent dma masks
kconfig: fix failing to generate auto.conf
scsi: lpfc: Fix pt2pt NVMe PRLI reject LOGO loop
EDAC: Fix calculation of returned address and next offset in edac_align_ptr()
net: sched: limit TC_ACT_REPEAT loops
dmaengine: sh: rcar-dmac: Check for error num after setting mask
dmaengine: stm32-dmamux: Fix PM disable depth imbalance in stm32_dmamux_probe
dmaengine: sh: rcar-dmac: Check for error num after dma_set_max_seg_size
i2c: qcom-cci: don't delete an unregistered adapter
i2c: qcom-cci: don't put a device tree node before i2c_add_adapter()
copy_process(): Move fd_install() out of sighand->siglock critical section
i2c: brcmstb: fix support for DSL and CM variants
lockdep: Correct lock_classes index mapping
Linux 5.10.102
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ief12c0d77b23f9796b81a8fc3b79ac6589e81dc9
[ Upstream commit 67d6212afda218d564890d1674bab28e8612170f ]
This reverts commit 774a1221e8.
We need to finish all async code before the module init sequence is
done. In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule(). Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked. This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().
For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);
We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread. As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.
The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)
modprobe pm80xx worker
...
do_init_module()
...
pci_call_probe()
work_on_cpu(local_pci_probe)
local_pci_probe()
pm8001_pci_probe()
scsi_scan_host()
async_schedule()
worker->flags |= PF_USED_ASYNC;
...
< return from worker >
...
if (current->flags & PF_USED_ASYNC) <--- false
async_synchronize_full();
Commit 21c3c5d280 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221e8
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.
Since commit 0fdff3ec6d ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.
Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.
Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.74
ext4: check and update i_disksize properly
ext4: correct the error path of ext4_write_inline_data_end()
ASoC: Intel: sof_sdw: tag SoundWire BEs as non-atomic
HID: apple: Fix logical maximum and usage maximum of Magic Keyboard JIS
netfilter: ip6_tables: zero-initialize fragment offset
HID: wacom: Add new Intuos BT (CTL-4100WL/CTL-6100WL) device IDs
ASoC: SOF: loader: release_firmware() on load failure to avoid batching
netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic
netfilter: nf_nat_masquerade: defer conntrack walk to work queue
mac80211: Drop frames from invalid MAC address in ad-hoc mode
m68k: Handle arrivals of multiple signals correctly
hwmon: (ltc2947) Properly handle errors when looking for the external clock
net: prevent user from passing illegal stab size
mac80211: check return value of rhashtable_init
vboxfs: fix broken legacy mount signature checking
net: sun: SUNVNET_COMMON should depend on INET
drm/amdgpu: fix gart.bo pin_count leak
scsi: ses: Fix unsigned comparison with less than zero
scsi: virtio_scsi: Fix spelling mistake "Unsupport" -> "Unsupported"
perf/core: fix userpage->time_enabled of inactive events
sched: Always inline is_percpu_thread()
hwmon: (pmbus/ibm-cffps) max_power_out swap changes
Linux 5.10.74
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic5235aa8641c64ee2849355a7f10613ae204c74e
Changes in 5.10.68
drm/bridge: lt9611: Fix handling of 4k panels
btrfs: fix upper limit for max_inline for page size 64K
io_uring: ensure symmetry in handling iter types in loop_rw_iter()
xen: reset legacy rtc flag for PV domU
bnx2x: Fix enabling network interfaces without VFs
arm64/sve: Use correct size when reinitialising SVE state
PM: base: power: don't try to use non-existing RTC for storing data
PCI: Add AMD GPU multi-function power dependencies
drm/amd/amdgpu: Increase HWIP_MAX_INSTANCE to 10
drm/etnaviv: return context from etnaviv_iommu_context_get
drm/etnaviv: put submit prev MMU context when it exists
drm/etnaviv: stop abusing mmu_context as FE running marker
drm/etnaviv: keep MMU context across runtime suspend/resume
drm/etnaviv: exec and MMU state is lost when resetting the GPU
drm/etnaviv: fix MMU context leak on GPU reset
drm/etnaviv: reference MMU context when setting up hardware state
drm/etnaviv: add missing MMU context put when reaping MMU mapping
s390/sclp: fix Secure-IPL facility detection
x86/pat: Pass valid address to sanitize_phys()
x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
tipc: fix an use-after-free issue in tipc_recvmsg
ethtool: Fix rxnfc copy to user buffer overflow
net/{mlx5|nfp|bnxt}: Remove unnecessary RTNL lock assert
net-caif: avoid user-triggerable WARN_ON(1)
ptp: dp83640: don't define PAGE0
dccp: don't duplicate ccid when cloning dccp sock
net/l2tp: Fix reference count leak in l2tp_udp_recv_core
r6040: Restore MDIO clock frequency after MAC reset
tipc: increase timeout in tipc_sk_enqueue()
drm/rockchip: cdn-dp-core: Make cdn_dp_core_resume __maybe_unused
perf machine: Initialize srcline string member in add_location struct
net/mlx5: FWTrace, cancel work on alloc pd error flow
net/mlx5: Fix potential sleeping in atomic context
nvme-tcp: fix io_work priority inversion
events: Reuse value read using READ_ONCE instead of re-reading it
net: ipa: initialize all filter table slots
gen_compile_commands: fix missing 'sys' package
vhost_net: fix OoB on sendmsg() failure.
net/af_unix: fix a data-race in unix_dgram_poll
net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup
x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=y
tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()
selftest: net: fix typo in altname test
qed: Handle management FW error
udp_tunnel: Fix udp_tunnel_nic work-queue type
dt-bindings: arm: Fix Toradex compatible typo
ibmvnic: check failover_pending in login response
KVM: PPC: Book3S HV: Tolerate treclaim. in fake-suspend mode changing registers
bnxt_en: make bnxt_free_skbs() safe to call after bnxt_free_mem()
net: hns3: pad the short tunnel frame before sending to hardware
net: hns3: change affinity_mask to numa node range
net: hns3: disable mac in flr process
net: hns3: fix the timing issue of VF clearing interrupt sources
mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
dt-bindings: mtd: gpmc: Fix the ECC bytes vs. OOB bytes equation
mfd: db8500-prcmu: Adjust map to reality
PCI: Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms
fuse: fix use after free in fuse_read_interrupt()
PCI: tegra194: Fix handling BME_CHGED event
PCI: tegra194: Fix MSI-X programming
PCI: tegra: Fix OF node reference leak
mfd: Don't use irq_create_mapping() to resolve a mapping
PCI: rcar: Fix runtime PM imbalance in rcar_pcie_ep_probe()
tracing/probes: Reject events which have the same name of existing one
PCI: cadence: Use bitfield for *quirk_retrain_flag* instead of bool
PCI: cadence: Add quirk flag to set minimum delay in LTSSM Detect.Quiet state
PCI: j721e: Add PCIe support for J7200
PCI: j721e: Add PCIe support for AM64
PCI: Add ACS quirks for Cavium multi-function devices
watchdog: Start watchdog in watchdog_set_last_hw_keepalive only if appropriate
octeontx2-af: Add additional register check to rvu_poll_reg()
Set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
net: usb: cdc_mbim: avoid altsetting toggling for Telit LN920
block, bfq: honor already-setup queue merges
PCI: ibmphp: Fix double unmap of io_mem
ethtool: Fix an error code in cxgb2.c
NTB: Fix an error code in ntb_msit_probe()
NTB: perf: Fix an error code in perf_setup_inbuf()
s390/bpf: Fix optimizing out zero-extensions
s390/bpf: Fix 64-bit subtraction of the -0x80000000 constant
s390/bpf: Fix branch shortening during codegen pass
mfd: axp20x: Update AXP288 volatile ranges
backlight: ktd253: Stabilize backlight
PCI: of: Don't fail devm_pci_alloc_host_bridge() on missing 'ranges'
PCI: iproc: Fix BCMA probe resource handling
netfilter: Fix fall-through warnings for Clang
netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
KVM: arm64: Restrict IPA size to maximum 48 bits on 4K and 16K page size
PCI: Fix pci_dev_str_match_path() alloc while atomic bug
mfd: tqmx86: Clear GPIO IRQ resource when no IRQ is set
tracing/boot: Fix a hist trigger dependency for boot time tracing
mtd: mtdconcat: Judge callback existence based on the master
mtd: mtdconcat: Check _read, _write callbacks existence before assignment
KVM: arm64: Fix read-side race on updates to vcpu reset state
KVM: arm64: Handle PSCI resets before userspace touches vCPU state
PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n
mtd: rawnand: cafe: Fix a resource leak in the error handling path of 'cafe_nand_probe()'
ARC: export clear_user_page() for modules
perf unwind: Do not overwrite FEATURE_CHECK_LDFLAGS-libunwind-{x86,aarch64}
perf bench inject-buildid: Handle writen() errors
gpio: mpc8xxx: Fix a resources leak in the error handling path of 'mpc8xxx_probe()'
gpio: mpc8xxx: Use 'devm_gpiochip_add_data()' to simplify the code and avoid a leak
net: dsa: tag_rtl4_a: Fix egress tags
selftests: mptcp: clean tmp files in simult_flows
net: hso: add failure handler for add_net_device
net: dsa: b53: Fix calculating number of switch ports
net: dsa: b53: Set correct number of ports in the DSA struct
netfilter: socket: icmp6: fix use-after-scope
fq_codel: reject silly quantum parameters
qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom
ip_gre: validate csum_start only on pull
net: dsa: b53: Fix IMP port setup on BCM5301x
bnxt_en: fix stored FW_PSID version masks
bnxt_en: Fix asic.rev in devlink dev info command
bnxt_en: log firmware debug notifications
bnxt_en: Consolidate firmware reset event logging.
bnxt_en: Convert to use netif_level() helpers.
bnxt_en: Improve logging of error recovery settings information.
bnxt_en: Fix possible unintended driver initiated error recovery
mfd: lpc_sch: Partially revert "Add support for Intel Quark X1000"
mfd: lpc_sch: Rename GPIOBASE to prevent build error
net: renesas: sh_eth: Fix freeing wrong tx descriptor
x86/mce: Avoid infinite loop for copy from user recovery
bnxt_en: Fix error recovery regression
net: dsa: bcm_sf2: Fix array overrun in bcm_sf2_num_active_ports()
Linux 5.10.68
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I542f48f8de516dcabce91d3d399583483aba0da7
commit 81065b35e2486c024c7aa86caed452e1f01a59d4 upstream.
There are two cases for machine check recovery:
1) The machine check was triggered by ring3 (application) code.
This is the simpler case. The machine check handler simply queues
work to be executed on return to user. That code unmaps the page
from all users and arranges to send a SIGBUS to the task that
triggered the poison.
2) The machine check was triggered in kernel code that is covered by
an exception table entry. In this case the machine check handler
still queues a work entry to unmap the page, etc. but this will
not be called right away because the #MC handler returns to the
fix up code address in the exception table entry.
Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.
Specifically, the work is queued using the ->mce_kill_me callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user, it enters an infinite loop processing the same entry for ever.
There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.
1) Some code (e.g. futex) first tries a get_user() with page faults
disabled. If this fails, the code retries with page faults enabled
expecting that this will resolve the page fault.
2) Copy from user code retries a copy in byte-at-time mode to check
whether any additional bytes can be copied.
On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.
Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).
Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.
[ bp: Massage commit message, drop noinstr, fix typo, extend panic
messages. ]
Fixes: 5567d11c21 ("x86/mce: Send #MC singal from task work")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.44
proc: Track /proc/$pid/attr/ opener mm_struct
ASoC: max98088: fix ni clock divider calculation
ASoC: amd: fix for pcm_read() error
spi: Fix spi device unregister flow
spi: spi-zynq-qspi: Fix stack violation bug
bpf: Forbid trampoline attach for functions with variable arguments
net/nfc/rawsock.c: fix a permission check bug
usb: cdns3: Fix runtime PM imbalance on error
ASoC: Intel: bytcr_rt5640: Add quirk for the Glavey TM800A550L tablet
ASoC: Intel: bytcr_rt5640: Add quirk for the Lenovo Miix 3-830 tablet
vfio-ccw: Reset FSM state to IDLE inside FSM
vfio-ccw: Serialize FSM IDLE state with I/O completion
ASoC: sti-sas: add missing MODULE_DEVICE_TABLE
spi: sprd: Add missing MODULE_DEVICE_TABLE
usb: chipidea: udc: assign interrupt number to USB gadget structure
isdn: mISDN: netjet: Fix crash in nj_probe:
bonding: init notify_work earlier to avoid uninitialized use
netlink: disable IRQs for netlink_lock_table()
net: mdiobus: get rid of a BUG_ON()
cgroup: disable controllers at parse time
wq: handle VM suspension in stall detection
net/qla3xxx: fix schedule while atomic in ql_sem_spinlock
RDS tcp loopback connection can hang
net:sfc: fix non-freed irq in legacy irq mode
scsi: bnx2fc: Return failure if io_req is already in ABTS processing
scsi: vmw_pvscsi: Set correct residual data length
scsi: hisi_sas: Drop free_irq() of devm_request_irq() allocated irq
scsi: target: qla2xxx: Wait for stop_phase1 at WWN removal
net: macb: ensure the device is available before accessing GEMGXL control registers
net: appletalk: cops: Fix data race in cops_probe1
net: dsa: microchip: enable phy errata workaround on 9567
nvme-fabrics: decode host pathing error for connect
MIPS: Fix kernel hang under FUNCTION_GRAPH_TRACER and PREEMPT_TRACER
dm verity: fix require_signatures module_param permissions
bnx2x: Fix missing error code in bnx2x_iov_init_one()
nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
nvmet: fix false keep-alive timeout when a controller is torn down
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers
spi: Don't have controller clean up spi device before driver unbind
spi: Cleanup on failure of initial setup
i2c: mpc: Make use of i2c_recover_bus()
i2c: mpc: implement erratum A-004447 workaround
ALSA: seq: Fix race of snd_seq_timer_open()
ALSA: firewire-lib: fix the context to call snd_pcm_stop_xrun()
ALSA: hda/realtek: headphone and mic don't work on an Acer laptop
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Elite Dragonfly G2
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP EliteBook x360 1040 G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 840 Aero G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP ZBook Power G8
spi: bcm2835: Fix out-of-bounds access with more than 4 slaves
Revert "ACPI: sleep: Put the FACS table after using it"
drm: Fix use-after-free read in drm_getunique()
drm: Lock pointer access in drm_master_release()
perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
KVM: X86: MMU: Use the correct inherited permissions to get shadow page
kvm: avoid speculation-based attacks from out-of-range memslot accesses
staging: rtl8723bs: Fix uninitialized variables
async_xor: check src_offs is not NULL before updating it
btrfs: return value from btrfs_mark_extent_written() in case of error
btrfs: promote debugging asserts to full-fledged checks in validate_super
cgroup1: don't allow '\n' in renaming
ftrace: Do not blindly read the ip address in ftrace_bug()
mmc: renesas_sdhi: abort tuning when timeout detected
mmc: renesas_sdhi: Fix HS400 on R-Car M3-W+
USB: f_ncm: ncm_bitrate (speed) is unsigned
usb: f_ncm: only first packet of aggregate needs to start timer
usb: pd: Set PD_T_SINK_WAIT_CAP to 310ms
usb: dwc3-meson-g12a: fix usb2 PHY glue init when phy0 is disabled
usb: dwc3: meson-g12a: Disable the regulator in the error handling path of the probe
usb: dwc3: gadget: Bail from dwc3_gadget_exit() if dwc->gadget is NULL
usb: dwc3: ep0: fix NULL pointer exception
usb: musb: fix MUSB_QUIRK_B_DISCONNECT_99 handling
usb: typec: wcove: Use LE to CPU conversion when accessing msg->header
usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe()
usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource()
usb: gadget: f_fs: Ensure io_completion_wq is idle during unbind
USB: serial: ftdi_sio: add NovaTech OrionMX product ID
USB: serial: omninet: add device id for Zyxel Omni 56K Plus
USB: serial: quatech2: fix control-request directions
USB: serial: cp210x: fix alternate function for CP2102N QFN20
usb: gadget: eem: fix wrong eem header operation
usb: fix various gadgets null ptr deref on 10gbps cabling.
usb: fix various gadget panics on 10gbps cabling
usb: typec: tcpm: cancel vdm and state machine hrtimer when unregister tcpm port
usb: typec: tcpm: cancel frs hrtimer when unregister tcpm port
regulator: core: resolve supply for boot-on/always-on regulators
regulator: max77620: Use device_set_of_node_from_dev()
regulator: bd718x7: Fix the BUCK7 voltage setting on BD71837
regulator: fan53880: Fix missing n_voltages setting
regulator: bd71828: Fix .n_voltages settings
regulator: rtmv20: Fix .set_current_limit/.get_current_limit callbacks
phy: usb: Fix misuse of IS_ENABLED
usb: dwc3: gadget: Disable gadget IRQ during pullup disable
usb: typec: mux: Fix copy-paste mistake in typec_mux_match
drm/mcde: Fix off by 10^3 in calculation
drm/msm/a6xx: fix incorrectly set uavflagprd_inv field for A650
drm/msm/a6xx: update/fix CP_PROTECT initialization
drm/msm/a6xx: avoid shadow NULL reference in failure path
RDMA/ipoib: Fix warning caused by destroying non-initial netns
RDMA/mlx4: Do not map the core_clock page to user space unless enabled
ARM: cpuidle: Avoid orphan section warning
vmlinux.lds.h: Avoid orphan section with !SMP
tools/bootconfig: Fix error return code in apply_xbc()
phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe()
ASoC: core: Fix Null-point-dereference in fmt_single_name()
ASoC: meson: gx-card: fix sound-dai dt schema
phy: ti: Fix an error code in wiz_probe()
gpio: wcd934x: Fix shift-out-of-bounds error
perf: Fix data race between pin_count increment/decrement
sched/fair: Keep load_avg and load_sum synced
sched/fair: Make sure to update tg contrib for blocked load
sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling
x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
IB/mlx5: Fix initializing CQ fragments buffer
NFS: Fix a potential NULL dereference in nfs_get_client()
NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
perf session: Correct buffer copying when peeking events
kvm: fix previous commit for 32-bit builds
NFS: Fix use-after-free in nfs4_init_client()
NFSv4: Fix second deadlock in nfs4_evict_inode()
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
scsi: core: Fix error handling of scsi_host_alloc()
scsi: core: Fix failure handling of scsi_add_host_with_dma()
scsi: core: Put .shost_dev in failure path if host state changes to RUNNING
scsi: core: Only put parent device if host state differs from SHOST_CREATED
tracing: Correct the length check which causes memory corruption
proc: only require mm_struct for writing
Linux 5.10.44
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic64172b4e72ccb54d96000b3065dd8b33aa9fef5
commit 68d7a190682aa4eb02db477328088ebad15acc83 upstream.
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
unnecessary util_est updates uses the LSB of util_est.enqueued. It is
exposed via _task_util_est() (and task_util_est()).
Commit 92a801e5d5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
mentions that the LSB is lost for util_est resolution but
find_energy_efficient_cpu() checks if task_util_est() returns 0 to
return prev_cpu early.
_task_util_est() returns the max value of util_est.ewma and
util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
So task_util_est() returning the max of task_util() and
_task_util_est() will never return 0 under the default
SCHED_FEAT(UTIL_EST, true).
To fix this use the MSB of util_est.enqueued instead and keep the flag
util_est internal, i.e. don't export it via _task_util_est().
The maximal possible util_avg value for a task is 1024 so the MSB of
'unsigned int util_est.enqueued' isn't used to store a util value.
As a caveat the code behind the util_est_se trace point has to filter
UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
be easy to do.
This also fixes an issue report by Xuewen Yan that util_est_update()
only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:
last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)
Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Enlarge OEM data reserved in struct task_struct to support OEM's
feature.
Bug: 188018141
Change-Id: I242459e0ecf327475bb4814110f89f160cca0ad1
Signed-off-by: Liangliang Li <liliangliang@vivo.com>
Try to mitigate potential future driver core api changes by adding a
pointer to struct signal_struct, struct sched_entity, struct
sched_rt_entity, and struct task_struct.
Based on a patch from Michal Marek <mmarek@suse.cz> from the SLES kernel
Leaf changes summary: 3 artifacts changed
Changed leaf types summary: 3 leaf types changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable
'struct sched_entity at sched.h:444:1' changed:
type size changed from 3584 to 4096 (in bits)
4 data member insertions:
'u64 sched_entity::android_kabi_reserved1', at offset 3584 (in bits) at sched.h:481:1
'u64 sched_entity::android_kabi_reserved2', at offset 3648 (in bits) at sched.h:482:1
'u64 sched_entity::android_kabi_reserved3', at offset 3712 (in bits) at sched.h:483:1
'u64 sched_entity::android_kabi_reserved4', at offset 3776 (in bits) at sched.h:484:1
1435 impacted interfaces:
'struct sched_rt_entity at sched.h:481:1' changed:
type size changed from 384 to 640 (in bits)
4 data member insertions:
'u64 sched_rt_entity::android_kabi_reserved1', at offset 384 (in bits) at sched.h:504:1
'u64 sched_rt_entity::android_kabi_reserved2', at offset 448 (in bits) at sched.h:505:1
'u64 sched_rt_entity::android_kabi_reserved3', at offset 512 (in bits) at sched.h:506:1
'u64 sched_rt_entity::android_kabi_reserved4', at offset 576 (in bits) at sched.h:507:1
1435 impacted interfaces:
'struct task_struct at sched.h:624:1' changed:
type size changed from 28672 to 30208 (in bits)
8 data member insertions:
'u64 task_struct::android_kabi_reserved1', at offset 20992 (in bits) at sched.h:1294:1
'u64 task_struct::android_kabi_reserved2', at offset 21056 (in bits) at sched.h:1295:1
'u64 task_struct::android_kabi_reserved3', at offset 21120 (in bits) at sched.h:1296:1
'u64 task_struct::android_kabi_reserved4', at offset 21184 (in bits) at sched.h:1297:1
'u64 task_struct::android_kabi_reserved5', at offset 21248 (in bits) at sched.h:1298:1
'u64 task_struct::android_kabi_reserved6', at offset 21312 (in bits) at sched.h:1299:1
'u64 task_struct::android_kabi_reserved7', at offset 21376 (in bits) at sched.h:1300:1
'u64 task_struct::android_kabi_reserved8', at offset 21440 (in bits) at sched.h:1301:1
there are data member changes:
1435 impacted interfaces:
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1449735b836399e9b356608727092334daf5c36b
The current approach to carry the wake_q length is exposed to an
intertask stack access. For example, if A sets the wake_q_head for
B but is preempted before it is able to set it back to NULL,
then B continues to point to an address corresponding to A's stack.
If B is then woken up by another task, it ends up accessing
the address pointing to A's stack. This causes a memory fault.
Replace this with a simple parameter which indicates the number
of tasks that are being woken up as part of the same event. This
avoids saving and accessing on stack pointers.
Bug: 173981591
Change-Id: I0031747d79a27673e680f7b1121eb4896ac7c699
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
Although userspace can carefully manage the affinity masks for such
tasks, one place where it is particularly problematic is execve()
because the CPU on which the execve() is occurring may be incompatible
with the new application image. In such a situation, it is desirable to
restrict the affinity mask of the task and ensure that the new image is
entered on a compatible CPU. From userspace's point of view, this looks
the same as if the incompatible CPUs have been hotplugged off in the
task's affinity mask.
In preparation for restricting the affinity mask for compat tasks on
arm64 systems without uniform support for 32-bit applications, introduce
force_compatible_cpus_allowed_ptr(), which restricts the affinity mask
for a task to contain only compatible CPUs.
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Bug: 178507149
Link: https://lore.kernel.org/linux-arch/20201208132835.6151-11-will@kernel.org/
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ief3e47f6aa8179eadf2e009f207cc0161b76f466
Change vendor fields to OEM fields by using
ANDROID_OEM_DATA when the vendor fields were
originally requested by an OEM.
Bug: 177481081
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: Iaf1e80e07bfd78efb5a9b7ff5894ff751f272f23
paused_cpus intending to force CPUs to go idle as quickly as possible,
adding a migration step, to drain the rq from any running task.
Two steps are actually needed. The first one, "lazy", will run before the
cpu_active_mask has been synced. The second one will run after. It is
possible for another CPU, to observe an outdated version of that mask and
to enqueue a task on a rq that has just been marked inactive. The second
migration is there to catch any of those spurious move, while the first
one will drain the rq as quickly as possible to let the CPU reach an idle
state.
Bug: 161210528
Change-Id: Ie26c2e4c42665dd61d41a899a84536e56bf2b887
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Some partners have value-adds based on aosp/540066, which cannot be
carried in ACK in its entirety as it no longer makes sense as-is (the
select_idle_capacity() rework upstream solved the issue differently).
It seems that those partners do not actually need the wake-wide tweaks,
they only need to access the wake_q length for wake-up balance. To
support this, add minimal tracking to the wake_q infrastructure in the
core kernel, but do that by adding a pointer to the wake_q_head to
task_struct directly to not litter all sched classes with an additional
sibling_count_hint argument to the select_task_rq callbacks.
Modules needing to access the wake_q length can do so by dereferencing
p->wake_q_head in the wake-up path when it is non-NULL.
Bug: 173981591
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I9a98167face92e70aba847d9f04d0c216065478c
Pull scheduler fixes from Thomas Gleixner:
"A couple of scheduler fixes:
- Make the conditional update of the overutilized state work
correctly by caching the relevant flags state before overwriting
them and checking them afterwards.
- Fix a data race in the wakeup path which caused loadavg on ARM64
platforms to become a random number generator.
- Fix the ordering of the iowaiter accounting operations so it can't
be decremented before it is incremented.
- Fix a bug in the deadline scheduler vs. priority inheritance when a
non-deadline task A has inherited the parameters of a deadline task
B and then blocks on a non-deadline task C.
The second inheritance step used the static deadline parameters of
task A, which are usually 0, instead of further propagating task
B's parameters. The zero initialized parameters trigger a bug in
the deadline scheduler"
* tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix priority inheritance with multiple scheduling classes
sched: Fix rq->nr_iowait ordering
sched: Fix data-race in wakeup
sched/fair: Fix overutilized update in enqueue_task_fair()
Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).
------------[ cut here ]------------
kernel BUG at kernel/sched/deadline.c:1462!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
Hardware name: ...
RIP: 0010:enqueue_task_dl+0x335/0x910
Code: ...
RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
? intel_pstate_update_util_hwp+0x13/0x170
rt_mutex_setprio+0x1cc/0x4b0
task_blocks_on_rt_mutex+0x225/0x260
rt_spin_lock_slowlock_locked+0xab/0x2d0
rt_spin_lock_slowlock+0x50/0x80
hrtimer_grab_expiry_lock+0x20/0x30
hrtimer_cancel+0x13/0x30
do_nanosleep+0xa0/0x150
hrtimer_nanosleep+0xe1/0x230
? __hrtimer_init_sleeper+0x60/0x60
__x64_sys_nanosleep+0x8d/0xa0
do_syscall_64+0x4a/0x100
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fa58b52330d
...
---[ end trace 0000000000000002 ]—
He also provided a simple reproducer creating the situation below:
So the execution order of locking steps are the following
(N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
are mutexes that are enabled * with priority inheritance.)
Time moves forward as this timeline goes down:
N1 N2 D1
| | |
| | |
Lock(M1) | |
| | |
| Lock(M2) |
| | |
| | Lock(M2)
| | |
| Lock(M1) |
| (!!bug triggered!) |
Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).
Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().
Fix this by keeping track of the original donor and using its parameters
when a task is boosted.
Reported-by: Glenn Elliott <glenn@aurora.tech>
Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
Mel reported that on some ARM64 platforms loadavg goes bananas and
Will tracked it down to the following race:
CPU0 CPU1
schedule()
prev->sched_contributes_to_load = X;
deactivate_task(prev);
try_to_wake_up()
if (p->on_rq &&) // false
if (smp_load_acquire(&p->on_cpu) && // true
ttwu_queue_wakelist())
p->sched_remote_wakeup = Y;
smp_store_release(prev->on_cpu, 0);
where both p->sched_contributes_to_load and p->sched_remote_wakeup are
in the same word, and thus the stores X and Y race (and can clobber
one another's data).
Whereas prior to commit c6e7bd7afa ("sched/core: Optimize ttwu()
spinning on p->on_cpu") the p->on_cpu handoff serialized access to
p->sched_remote_wakeup (just as it still does with
p->sched_contributes_to_load) that commit broke that by calling
ttwu_queue_wakelist() with p->on_cpu != 0.
However, due to
p->XXX = X ttwu()
schedule() if (p->on_rq && ...) // false
smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) &&
deactivate_task() ttwu_queue_wakelist())
p->on_rq = 0; p->sched_remote_wakeup = Y;
We can be sure any 'current' store is complete and 'current' is
guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
the current flags word.
Note: while the observed failure was loadavg accounting gone wrong due
to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
is also possible where schedule() clobbers p->sched_remote_wakeup,
this could result in enqueue_entity() wrecking ->vruntime and causing
scheduling artifacts.
Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Debugged-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
Defer the softirq processing to ksoftirqd if a RT task is running
or queued on the current CPU. This complements the RT task placement
algorithm which tries to find a CPU that is not currently busy with
softirqs.
Currently NET_TX, NET_RX, BLOCK and TASKLET softirqs are only deferred
as they can potentially run for long time.
Bug: 168521633
Change-Id: Id7665244af6bbd5a96d9e591cf26154e9eaa860c
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
[satyap@codeaurora.org: trivial merge conflict resolution.]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
[elavila: Port to mainline, squash with bugfix]
Signed-off-by: J. Avila <elavila@google.com>
Steps on the way to 5.10-rc1
Resolves conflicts in:
fs/userfaultfd.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fe3c818f1f6565cfd4fa551de72d2b72ef60af
Pull io_uring updates from Jens Axboe:
- Add blkcg accounting for io-wq offload (Dennis)
- A use-after-free fix for io-wq (Hillf)
- Cancelation fixes and improvements
- Use proper files_struct references for offload
- Cleanup of io_uring_get_socket() since that can now go into our own
header
- SQPOLL fixes and cleanups, and support for sharing the thread
- Improvement to how page accounting is done for registered buffers and
huge pages, accounting the real pinned state
- Series cleaning up the xarray code (Willy)
- Various cleanups, refactoring, and improvements (Pavel)
- Use raw spinlock for io-wq (Sebastian)
- Add support for ring restrictions (Stefano)
* tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits)
io_uring: keep a pointer ref_node in file_data
io_uring: refactor *files_register()'s error paths
io_uring: clean file_data access in files_register
io_uring: don't delay io_init_req() error check
io_uring: clean leftovers after splitting issue
io_uring: remove timeout.list after hrtimer cancel
io_uring: use a separate struct for timeout_remove
io_uring: improve submit_state.ios_left accounting
io_uring: simplify io_file_get()
io_uring: kill extra check in fixed io_file_get()
io_uring: clean up ->files grabbing
io_uring: don't io_prep_async_work() linked reqs
io_uring: Convert advanced XArray uses to the normal API
io_uring: Fix XArray usage in io_uring_add_task_file
io_uring: Fix use of XArray in __io_uring_files_cancel
io_uring: fix break condition for __io_uring_register() waiting
io_uring: no need to call xa_destroy() on empty xarray
io_uring: batch account ->req_issue and task struct references
io_uring: kill callback_head argument for io_req_task_work_add()
io_uring: move req preps out of io_issue_sqe()
...
Pull scheduler updates from Ingo Molnar:
- reorganize & clean up the SD* flags definitions and add a bunch of
sanity checks. These new checks caught quite a few bugs or at least
inconsistencies, resulting in another set of patches.
- rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
- add a new tracepoint to improve CPU capacity tracking
- improve overloaded SMP system load-balancing behavior
- tweak SMT balancing
- energy-aware scheduling updates
- NUMA balancing improvements
- deadline scheduler fixes and improvements
- CPU isolation fixes
- misc cleanups, simplifications and smaller optimizations
* tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
sched/deadline: Unthrottle PI boosted threads while enqueuing
sched/debug: Add new tracepoint to track cpu_capacity
sched/fair: Tweak pick_next_entity()
rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
rseq/selftests,x86_64: Add rseq_offset_deref_addv()
rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
sched/fair: Use dst group while checking imbalance for NUMA balancer
sched/fair: Reduce busy load balance interval
sched/fair: Minimize concurrent LBs between domain level
sched/fair: Reduce minimal imbalance threshold
sched/fair: Relax constraint on task's load during load balance
sched/fair: Remove the force parameter of update_tg_load_avg()
sched/fair: Fix wrong cpu selecting from isolated domain
sched: Remove unused inline function uclamp_bucket_base_value()
sched/rt: Disable RT_RUNTIME_SHARE by default
sched/deadline: Fix stale throttling on de-/boosted tasks
sched/numa: Use runnable_avg to classify node
sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
...
Pull RAS updates from Borislav Petkov:
- Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory
by sending them a SIGBUS on return to user space and umapping the
faulty memory, by Tony Luck and Youquan Song.
- memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
- New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
- Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the
hw eval phase and they don't make it into production.
- Misc fixes, improvements and cleanups, as always.
* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
x86/mce: Decode a kernel instruction to determine if it is copying from user
x86/mce: Recover from poison found while copying from user space
x86/mce: Avoid tail copy when machine check terminated a copy from user
x86/mce: Add _ASM_EXTABLE_CPY for copy user access
x86/mce: Provide method to find out the type of an exception handler
x86/mce: Pass pointer to saved pt_regs to severity calculation routines
x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
x86/mce: Add Skylake quirk for patrol scrub reported errors
RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
x86/mce: Annotate mce_rd/wrmsrl() with noinstr
x86/mce/dev-mcelog: Do not update kflags on AMD systems
x86/mce: Stop mce_reign() from re-computing severity for every CPU
x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
x86/mce: Increase maximum number of banks to 64
x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
RAS/CEC: Fix cec_init() prototype
Existing kernel code can only recover from a machine check on code that
is tagged in the exception table with a fault handling recovery path.
Add two new fields in the task structure to pass information from
machine check handler to the "task_work" that is queued to run before
the task returns to user mode:
+ mce_vaddr: will be initialized to the user virtual address of the fault
in the case where the fault occurred in the kernel copying data from
a user address. This is so that kill_me_maybe() can provide that
information to the user SIGBUS handler.
+ mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
to determine if mce_vaddr is applicable to this error.
Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the task.
Use a new helper function to share common code between the "fault
in user mode" case and the "fault while copying from user" case.
New code paths will be activated by the next patch which sets
MCE_IN_KERNEL_COPYIN.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.
By moving the bit definition within the same block, the compiler can
create/use one immediate value.
For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".
Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819195505.y3fxk72sotnrkczi@linutronix.de
- Add the hook to get mutex/rwsem information that the tasks
are waiting for.
- Add the hook to print messages for sched_show_task.
- ANDROID_VENDOR_DATA_ARRAY added to task_struct
Bug: 162776704
Signed-off-by: Sangmoon Kim <sangmoon.kim@samsung.com>
Change-Id: Ib436fbd8d0ad509c3b5a73ea8f5170e0761a13fd
(cherry picked from commit b519ac4237)
Pull more timer updates from Thomas Gleixner:
"A set of posix CPU timer changes which allows to defer the heavy work
of posix CPU timers into task work context. The tick interrupt is
reduced to a quick check which queues the work which is doing the
heavy lifting before returning to user space or going back to guest
mode. Moving this out is deferring the signal delivery slightly but
posix CPU timers are inaccurate by nature as they depend on the tick
so there is no real damage. The relevant test cases all passed.
This lifts the last offender for RT out of the hard interrupt context
tick handler, but it also has the general benefit that the actual
heavy work is accounted to the task/process and not to the tick
interrupt itself.
Further optimizations are possible to break long sighand lock hold and
interrupt disabled (on !RT kernels) times when a massive amount of
posix CPU timers (which are unpriviledged) is armed for a
task/process.
This is currently only enabled for x86 because the architecture has to
ensure that task work is handled in KVM before entering a guest, which
was just established for x86 with the new common entry/exit code which
got merged post 5.8 and is not the case for other KVM architectures"
* tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Select POSIX_CPU_TIMERS_TASK_WORK
posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
posix-cpu-timers: Split run_posix_cpu_timers()
Tiny steps on the way to 5.9-rc1.
Fixes conflicts in:
fs/f2fs/inline.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I16d863ae44a51156499458e8c3486587cbe2babe
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
Steps on the way to 5.9-rc1
Resolves conflicts in:
drivers/irqchip/qcom-pdc.c
include/linux/device.h
net/xfrm/xfrm_state.c
security/lsm_audit.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4aeb3d04f4717714a421721eb3ce690c099bb30a
Pull sched/fifo updates from Ingo Molnar:
"This adds the sched_set_fifo*() encapsulation APIs to remove static
priority level knowledge from non-scheduler code.
The three APIs for non-scheduler code to set SCHED_FIFO are:
- sched_set_fifo()
- sched_set_fifo_low()
- sched_set_normal()
These are two FIFO priority levels: default (high), and a 'low'
priority level, plus sched_set_normal() to set the policy back to
non-SCHED_FIFO.
Since the changes affect a lot of non-scheduler code, we kept this in
a separate tree"
* tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched,tracing: Convert to sched_set_fifo()
sched: Remove sched_set_*() return value
sched: Remove sched_setscheduler*() EXPORTs
sched,psi: Convert to sched_set_fifo_low()
sched,rcutorture: Convert to sched_set_fifo_low()
sched,rcuperf: Convert to sched_set_fifo_low()
sched,locktorture: Convert to sched_set_fifo()
sched,irq: Convert to sched_set_fifo()
sched,watchdog: Convert to sched_set_fifo()
sched,serial: Convert to sched_set_fifo()
sched,powerclamp: Convert to sched_set_fifo()
sched,ion: Convert to sched_set_normal()
sched,powercap: Convert to sched_set_fifo*()
sched,spi: Convert to sched_set_fifo*()
sched,mmc: Convert to sched_set_fifo*()
sched,ivtv: Convert to sched_set_fifo*()
sched,drm/scheduler: Convert to sched_set_fifo*()
sched,msm: Convert to sched_set_fifo*()
sched,psci: Convert to sched_set_fifo*()
sched,drbd: Convert to sched_set_fifo*()
...