Changes in 5.10.55
tools: Allow proper CC/CXX/... override with LLVM=1 in Makefile.include
io_uring: fix link timeout refs
KVM: x86: determine if an exception has an error code only when injecting it.
af_unix: fix garbage collect vs MSG_PEEK
workqueue: fix UAF in pwq_unbound_release_workfn()
cgroup1: fix leaked context root causing sporadic NULL deref in LTP
net/802/mrp: fix memleak in mrp_request_join()
net/802/garp: fix memleak in garp_request_join()
net: annotate data race around sk_ll_usec
sctp: move 198 addresses from unusable to private scope
rcu-tasks: Don't delete holdouts within trc_inspect_reader()
rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
ipv6: allocate enough headroom in ip6_finish_output2()
drm/ttm: add a check against null pointer dereference
hfs: add missing clean-up in hfs_fill_super
hfs: fix high memory mapping in hfs_bnode_read
hfs: add lock nesting notation to hfs_find_init
firmware: arm_scmi: Fix possible scmi_linux_errmap buffer overflow
firmware: arm_scmi: Fix range check for the maximum number of pending messages
cifs: fix the out of range assignment to bit fields in parse_server_interfaces
iomap: remove the length variable in iomap_seek_data
iomap: remove the length variable in iomap_seek_hole
ARM: dts: versatile: Fix up interrupt controller node names
ipv6: ip6_finish_output2: set sk into newly allocated nskb
Linux 5.10.55
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2d673bdde784b3689af73289305091dbd4ead042
commit 1e7107c5ef44431bc1ebbd4c353f1d7c22e5f2ec upstream.
Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests. Things like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:kernfs_sop_show_path+0x1b/0x60
...or these others:
RIP: 0010:do_mkdirat+0x6a/0xf0
RIP: 0010:d_alloc_parallel+0x98/0x510
RIP: 0010:do_readlinkat+0x86/0x120
There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference. I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.
In commit 71d883c37e ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:
--------------
struct cgroup_subsys *ss;
- struct dentry *dentry;
[...]
- dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
- CGROUP_SUPER_MAGIC, ns);
[...]
- if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
- struct super_block *sb = dentry->d_sb;
- dput(dentry);
+ ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
+ if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
+ struct super_block *sb = fc->root->d_sb;
+ dput(fc->root);
deactivate_locked_super(sb);
msleep(10);
return restart_syscall();
}
--------------
In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case. With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.
A fix would be to not leave the stale reference in fc->root as follows:
--------------
dput(fc->root);
+ fc->root = NULL;
deactivate_locked_super(sb);
--------------
...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.52
certs: add 'x509_revocation_list' to gitignore
cifs: handle reconnect of tcon when there is no cached dfs referral
KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabled
KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
scsi: core: Fix bad pointer dereference when ehandler kthread is invalid
scsi: zfcp: Report port fc_security as unknown early during remote cable pull
tracing: Do not reference char * as a string in histograms
drm/i915/gtt: drop the page table optimisation
drm/i915/gt: Fix -EDEADLK handling regression
cgroup: verify that source is a string
fbmem: Do not delete the mode that is still in use
drm/dp_mst: Do not set proposed vcpi directly
drm/dp_mst: Avoid to mess up payload table by ports in stale topology
drm/dp_mst: Add missing drm parameters to recently added call to drm_dbg_kms()
drm/ingenic: Fix non-OSD mode
drm/ingenic: Switch IPU plane to type OVERLAY
Revert "drm/ast: Remove reference to struct drm_device.pdev"
net: bridge: multicast: fix PIM hello router port marking race
net: bridge: multicast: fix MRD advertisement router port marking race
leds: tlc591xx: fix return value check in tlc591xx_probe()
ASoC: Intel: sof_sdw: add mutual exclusion between PCH DMIC and RT715
dmaengine: fsl-qdma: check dma_set_mask return value
scsi: arcmsr: Fix the wrong CDB payload report to IOP
srcu: Fix broken node geometry after early ssp init
rcu: Reject RCU_LOCKDEP_WARN() false positives
tty: serial: fsl_lpuart: fix the potential risk of division or modulo by zero
serial: fsl_lpuart: disable DMA for console and fix sysrq
misc/libmasm/module: Fix two use after free in ibmasm_init_one
misc: alcor_pci: fix null-ptr-deref when there is no PCI bridge
ASoC: intel/boards: add missing MODULE_DEVICE_TABLE
partitions: msdos: fix one-byte get_unaligned()
iio: gyro: fxa21002c: Balance runtime pm + use pm_runtime_resume_and_get().
iio: magn: bmc150: Balance runtime pm + use pm_runtime_resume_and_get()
ALSA: usx2y: Avoid camelCase
ALSA: usx2y: Don't call free_pages_exact() with NULL address
Revert "ALSA: bebob/oxfw: fix Kconfig entry for Mackie d.2 Pro"
usb: common: usb-conn-gpio: fix NULL pointer dereference of charger
w1: ds2438: fixing bug that would always get page0
scsi: arcmsr: Fix doorbell status being updated late on ARC-1886
scsi: hisi_sas: Propagate errors in interrupt_init_v1_hw()
scsi: lpfc: Fix "Unexpected timeout" error in direct attach topology
scsi: lpfc: Fix crash when lpfc_sli4_hba_setup() fails to initialize the SGLs
scsi: core: Cap scsi_host cmd_per_lun at can_queue
ALSA: ac97: fix PM reference leak in ac97_bus_remove()
tty: serial: 8250: serial_cs: Fix a memory leak in error handling path
scsi: mpt3sas: Fix deadlock while cancelling the running firmware event
scsi: core: Fixup calling convention for scsi_mode_sense()
scsi: scsi_dh_alua: Check for negative result value
fs/jfs: Fix missing error code in lmLogInit()
scsi: megaraid_sas: Fix resource leak in case of probe failure
scsi: megaraid_sas: Early detection of VD deletion through RaidMap update
scsi: megaraid_sas: Handle missing interrupts while re-enabling IRQs
scsi: iscsi: Add iscsi_cls_conn refcount helpers
scsi: iscsi: Fix conn use after free during resets
scsi: iscsi: Fix shost->max_id use
scsi: qedi: Fix null ref during abort handling
scsi: qedi: Fix race during abort timeouts
scsi: qedi: Fix TMF session block/unblock use
scsi: qedi: Fix cleanup session block/unblock use
mfd: da9052/stmpe: Add and modify MODULE_DEVICE_TABLE
mfd: cpcap: Fix cpcap dmamask not set warnings
ASoC: img: Fix PM reference leak in img_i2s_in_probe()
fsi: Add missing MODULE_DEVICE_TABLE
serial: tty: uartlite: fix console setup
s390/sclp_vt220: fix console name to match device
s390: disable SSP when needed
selftests: timers: rtcpie: skip test if default RTC device does not exist
ALSA: sb: Fix potential double-free of CSP mixer elements
powerpc/ps3: Add dma_mask to ps3_dma_region
iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails
iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation
ASoC: soc-pcm: fix the return value in dpcm_apply_symmetry()
gpio: zynq: Check return value of pm_runtime_get_sync
gpio: zynq: Check return value of irq_get_irq_data
scsi: storvsc: Correctly handle multiple flags in srb_status
ALSA: ppc: fix error return code in snd_pmac_probe()
selftests/powerpc: Fix "no_handler" EBB selftest
gpio: pca953x: Add support for the On Semi pca9655
powerpc/mm/book3s64: Fix possible build error
ASoC: soc-core: Fix the error return code in snd_soc_of_parse_audio_routing()
habanalabs/gaudi: set the correct cpu_id on MME2_QM failure
habanalabs: remove node from list before freeing the node
s390/processor: always inline stap() and __load_psw_mask()
s390/ipl_parm: fix program check new psw handling
s390/mem_detect: fix diag260() program check new psw handling
s390/mem_detect: fix tprot() program check new psw handling
Input: hideep - fix the uninitialized use in hideep_nvm_unlock()
ALSA: bebob: add support for ToneWeal FW66
ALSA: usb-audio: scarlett2: Fix 18i8 Gen 2 PCM Input count
ALSA: usb-audio: scarlett2: Fix data_mutex lock
ALSA: usb-audio: scarlett2: Fix scarlett2_*_ctl_put() return values
usb: gadget: f_hid: fix endianness issue with descriptors
usb: gadget: hid: fix error return code in hid_bind()
powerpc/boot: Fixup device-tree on little endian
ASoC: Intel: kbl_da7219_max98357a: shrink platform_id below 20 characters
backlight: lm3630a: Fix return code of .update_status() callback
ALSA: hda: Add IRQ check for platform_get_irq()
ALSA: usb-audio: scarlett2: Fix 6i6 Gen 2 line out descriptions
ALSA: firewire-motu: fix detection for S/PDIF source on optical interface in v2 protocol
leds: turris-omnia: add missing MODULE_DEVICE_TABLE
staging: rtl8723bs: fix macro value for 2.4Ghz only device
intel_th: Wait until port is in reset before programming it
i2c: core: Disable client irq on reboot/shutdown
phy: intel: Fix for warnings due to EMMC clock 175Mhz change in FIP
lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
power: supply: sc27xx: Add missing MODULE_DEVICE_TABLE
power: supply: sc2731_charger: Add missing MODULE_DEVICE_TABLE
pwm: spear: Don't modify HW state in .remove callback
PCI: ftpci100: Rename macro name collision
power: supply: ab8500: Avoid NULL pointers
PCI: hv: Fix a race condition when removing the device
power: supply: max17042: Do not enforce (incorrect) interrupt trigger type
power: reset: gpio-poweroff: add missing MODULE_DEVICE_TABLE
ARM: 9087/1: kprobes: test-thumb: fix for LLVM_IAS=1
PCI/P2PDMA: Avoid pci_get_slot(), which may sleep
NFSv4: Fix delegation return in cases where we have to retry
PCI: pciehp: Ignore Link Down/Up caused by DPC
watchdog: Fix possible use-after-free in wdt_startup()
watchdog: sc520_wdt: Fix possible use-after-free in wdt_turnoff()
watchdog: Fix possible use-after-free by calling del_timer_sync()
watchdog: imx_sc_wdt: fix pretimeout
watchdog: iTCO_wdt: Account for rebooting on second timeout
x86/fpu: Return proper error codes from user access functions
remoteproc: core: Fix cdev remove and rproc del
PCI: tegra: Add missing MODULE_DEVICE_TABLE
orangefs: fix orangefs df output.
ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
drm/gma500: Add the missed drm_gem_object_put() in psb_user_framebuffer_create()
NFS: nfs_find_open_context() may only select open files
power: supply: charger-manager: add missing MODULE_DEVICE_TABLE
power: supply: ab8500: add missing MODULE_DEVICE_TABLE
drm/amdkfd: fix sysfs kobj leak
pwm: img: Fix PM reference leak in img_pwm_enable()
pwm: tegra: Don't modify HW state in .remove callback
ACPI: AMBA: Fix resource name in /proc/iomem
ACPI: video: Add quirk for the Dell Vostro 3350
PCI: rockchip: Register IRQ handlers after device and data are ready
virtio-blk: Fix memory leak among suspend/resume procedure
virtio_net: Fix error handling in virtnet_restore()
virtio_console: Assure used length from device is limited
f2fs: atgc: fix to set default age threshold
NFSD: Fix TP_printk() format specifier in nfsd_clid_class
x86/signal: Detect and prevent an alternate signal stack overflow
f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
f2fs: compress: fix to disallow temp extension
remoteproc: k3-r5: Fix an error message
PCI/sysfs: Fix dsm_label_utf16s_to_utf8s() buffer overrun
power: supply: rt5033_battery: Fix device tree enumeration
NFSv4: Initialise connection to the server in nfs4_alloc_client()
NFSv4: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
misc: alcor_pci: fix inverted branch condition
um: fix error return code in slip_open()
um: fix error return code in winch_tramp()
ubifs: Fix off-by-one error
ubifs: journal: Fix error return code in ubifs_jnl_write_inode()
watchdog: aspeed: fix hardware timeout calculation
watchdog: jz4740: Fix return value check in jz4740_wdt_probe()
SUNRPC: prevent port reuse on transports which don't request it.
nfs: fix acl memory leak of posix_acl_create()
ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode
PCI: iproc: Fix multi-MSI base vector number allocation
PCI: iproc: Support multi-MSI only on uniprocessor kernel
f2fs: fix to avoid adding tab before doc section
x86/fpu: Fix copy_xstate_to_kernel() gap handling
x86/fpu: Limit xstate copy size in xstateregs_set()
PCI: intel-gw: Fix INTx enable
pwm: imx1: Don't disable clocks at device remove time
PCI: tegra194: Fix tegra_pcie_ep_raise_msi_irq() ill-defined shift
vdpa/mlx5: Fix umem sizes assignments on VQ create
vdpa/mlx5: Fix possible failure in umem size calculation
virtio_net: move tx vq operation under tx queue lock
nvme-tcp: can't set sk_user_data without write_lock
nfsd: Reduce contention for the nfsd_file nf_rwsem
ALSA: isa: Fix error return code in snd_cmi8330_probe()
vdpa/mlx5: Clear vq ready indication upon device reset
NFSv4/pnfs: Fix the layout barrier update
NFSv4/pnfs: Fix layoutget behaviour after invalidation
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
hexagon: handle {,SOFT}IRQENTRY_TEXT in linker script
hexagon: use common DISCARDS macro
ARM: dts: gemini-rut1xx: remove duplicate ethernet node
reset: RESET_BRCMSTB_RESCAL should depend on ARCH_BRCMSTB
reset: RESET_INTEL_GW should depend on X86
reset: a10sr: add missing of_match_table reference
ARM: exynos: add missing of_node_put for loop iteration
ARM: dts: exynos: fix PWM LED max brightness on Odroid XU/XU3
ARM: dts: exynos: fix PWM LED max brightness on Odroid HC1
ARM: dts: exynos: fix PWM LED max brightness on Odroid XU4
memory: stm32-fmc2-ebi: add missing of_node_put for loop iteration
memory: atmel-ebi: add missing of_node_put for loop iteration
reset: brcmstb: Add missing MODULE_DEVICE_TABLE
memory: pl353: Fix error return code in pl353_smc_probe()
ARM: dts: sun8i: h3: orangepi-plus: Fix ethernet phy-mode
rtc: fix snprintf() checking in is_rtc_hctosys()
arm64: dts: renesas: v3msk: Fix memory size
ARM: dts: r8a7779, marzen: Fix DU clock names
arm64: dts: ti: j7200-main: Enable USB2 PHY RX sensitivity workaround
arm64: dts: renesas: Add missing opp-suspend properties
arm64: dts: renesas: r8a7796[01]: Fix OPP table entry voltages
ARM: dts: stm32: Connect PHY IRQ line on DH STM32MP1 SoM
ARM: dts: stm32: Rework LAN8710Ai PHY reset on DHCOM SoM
arm64: dts: qcom: trogdor: Add no-hpd to DSI bridge node
firmware: tegra: Fix error return code in tegra210_bpmp_init()
firmware: arm_scmi: Reset Rx buffer to max size during async commands
dt-bindings: i2c: at91: fix example for scl-gpios
ARM: dts: BCM5301X: Fixup SPI binding
reset: bail if try_module_get() fails
arm64: dts: renesas: r8a779a0: Drop power-domains property from GIC node
arm64: dts: ti: k3-j721e-main: Fix external refclk input to SERDES
memory: fsl_ifc: fix leak of IO mapping on probe failure
memory: fsl_ifc: fix leak of private memory on probe failure
arm64: dts: allwinner: a64-sopine-baseboard: change RGMII mode to TXID
ARM: dts: dra7: Fix duplicate USB4 target module node
ARM: dts: am335x: align ti,pindir-d0-out-d1-in property with dt-shema
ARM: dts: am437x: align ti,pindir-d0-out-d1-in property with dt-shema
thermal/drivers/sprd: Add missing MODULE_DEVICE_TABLE
ARM: dts: imx6q-dhcom: Fix ethernet reset time properties
ARM: dts: imx6q-dhcom: Fix ethernet plugin detection problems
ARM: dts: imx6q-dhcom: Add gpios pinctrl for i2c bus recovery
thermal/drivers/rcar_gen3_thermal: Fix coefficient calculations
firmware: turris-mox-rwtm: fix reply status decoding function
firmware: turris-mox-rwtm: report failures better
firmware: turris-mox-rwtm: fail probing when firmware does not support hwrng
firmware: turris-mox-rwtm: show message about HWRNG registration
arm64: dts: rockchip: Re-add regulator-boot-on, regulator-always-on for vdd_gpu on rk3399-roc-pc
arm64: dts: rockchip: Re-add regulator-always-on for vcc_sdio for rk3399-roc-pc
scsi: be2iscsi: Fix an error handling path in beiscsi_dev_probe()
sched/uclamp: Ignore max aggregation if rq is idle
jump_label: Fix jump_label_text_reserved() vs __init
static_call: Fix static_call_text_reserved() vs __init
mips: always link byteswap helpers into decompressor
mips: disable branch profiling in boot/decompress.o
MIPS: vdso: Invalid GIC access through VDSO
scsi: scsi_dh_alua: Fix signedness bug in alua_rtpg()
seq_file: disallow extremely large seq buffer allocations
Linux 5.10.52
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic1b04661728db8b0e060ca6935783e15a22210da
commit 3b0462726e7ef281c35a7a4ae33e93ee2bc9975b upstream.
The following sequence can be used to trigger a UAF:
int fscontext_fd = fsopen("cgroup");
int fd_null = open("/dev/null, O_RDONLY);
int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
close_range(3, ~0U, 0);
The cgroup v1 specific fs parser expects a string for the "source"
parameter. However, it is perfectly legitimate to e.g. specify a file
descriptor for the "source" parameter. The fs parser doesn't know what
a filesystem allows there. So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.
This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value. Access to that union is
guarded by the param->type member. Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.
Fix this by verifying that param->type is actually a string and report
an error if not.
In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.
Fixes: 8d2451f499 ("cgroup1: switch to option-by-option parsing")
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@kernel.org>
Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.44
proc: Track /proc/$pid/attr/ opener mm_struct
ASoC: max98088: fix ni clock divider calculation
ASoC: amd: fix for pcm_read() error
spi: Fix spi device unregister flow
spi: spi-zynq-qspi: Fix stack violation bug
bpf: Forbid trampoline attach for functions with variable arguments
net/nfc/rawsock.c: fix a permission check bug
usb: cdns3: Fix runtime PM imbalance on error
ASoC: Intel: bytcr_rt5640: Add quirk for the Glavey TM800A550L tablet
ASoC: Intel: bytcr_rt5640: Add quirk for the Lenovo Miix 3-830 tablet
vfio-ccw: Reset FSM state to IDLE inside FSM
vfio-ccw: Serialize FSM IDLE state with I/O completion
ASoC: sti-sas: add missing MODULE_DEVICE_TABLE
spi: sprd: Add missing MODULE_DEVICE_TABLE
usb: chipidea: udc: assign interrupt number to USB gadget structure
isdn: mISDN: netjet: Fix crash in nj_probe:
bonding: init notify_work earlier to avoid uninitialized use
netlink: disable IRQs for netlink_lock_table()
net: mdiobus: get rid of a BUG_ON()
cgroup: disable controllers at parse time
wq: handle VM suspension in stall detection
net/qla3xxx: fix schedule while atomic in ql_sem_spinlock
RDS tcp loopback connection can hang
net:sfc: fix non-freed irq in legacy irq mode
scsi: bnx2fc: Return failure if io_req is already in ABTS processing
scsi: vmw_pvscsi: Set correct residual data length
scsi: hisi_sas: Drop free_irq() of devm_request_irq() allocated irq
scsi: target: qla2xxx: Wait for stop_phase1 at WWN removal
net: macb: ensure the device is available before accessing GEMGXL control registers
net: appletalk: cops: Fix data race in cops_probe1
net: dsa: microchip: enable phy errata workaround on 9567
nvme-fabrics: decode host pathing error for connect
MIPS: Fix kernel hang under FUNCTION_GRAPH_TRACER and PREEMPT_TRACER
dm verity: fix require_signatures module_param permissions
bnx2x: Fix missing error code in bnx2x_iov_init_one()
nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
nvmet: fix false keep-alive timeout when a controller is torn down
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers
spi: Don't have controller clean up spi device before driver unbind
spi: Cleanup on failure of initial setup
i2c: mpc: Make use of i2c_recover_bus()
i2c: mpc: implement erratum A-004447 workaround
ALSA: seq: Fix race of snd_seq_timer_open()
ALSA: firewire-lib: fix the context to call snd_pcm_stop_xrun()
ALSA: hda/realtek: headphone and mic don't work on an Acer laptop
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP Elite Dragonfly G2
ALSA: hda/realtek: fix mute/micmute LEDs and speaker for HP EliteBook x360 1040 G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 840 Aero G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP ZBook Power G8
spi: bcm2835: Fix out-of-bounds access with more than 4 slaves
Revert "ACPI: sleep: Put the FACS table after using it"
drm: Fix use-after-free read in drm_getunique()
drm: Lock pointer access in drm_master_release()
perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
KVM: X86: MMU: Use the correct inherited permissions to get shadow page
kvm: avoid speculation-based attacks from out-of-range memslot accesses
staging: rtl8723bs: Fix uninitialized variables
async_xor: check src_offs is not NULL before updating it
btrfs: return value from btrfs_mark_extent_written() in case of error
btrfs: promote debugging asserts to full-fledged checks in validate_super
cgroup1: don't allow '\n' in renaming
ftrace: Do not blindly read the ip address in ftrace_bug()
mmc: renesas_sdhi: abort tuning when timeout detected
mmc: renesas_sdhi: Fix HS400 on R-Car M3-W+
USB: f_ncm: ncm_bitrate (speed) is unsigned
usb: f_ncm: only first packet of aggregate needs to start timer
usb: pd: Set PD_T_SINK_WAIT_CAP to 310ms
usb: dwc3-meson-g12a: fix usb2 PHY glue init when phy0 is disabled
usb: dwc3: meson-g12a: Disable the regulator in the error handling path of the probe
usb: dwc3: gadget: Bail from dwc3_gadget_exit() if dwc->gadget is NULL
usb: dwc3: ep0: fix NULL pointer exception
usb: musb: fix MUSB_QUIRK_B_DISCONNECT_99 handling
usb: typec: wcove: Use LE to CPU conversion when accessing msg->header
usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe()
usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource()
usb: gadget: f_fs: Ensure io_completion_wq is idle during unbind
USB: serial: ftdi_sio: add NovaTech OrionMX product ID
USB: serial: omninet: add device id for Zyxel Omni 56K Plus
USB: serial: quatech2: fix control-request directions
USB: serial: cp210x: fix alternate function for CP2102N QFN20
usb: gadget: eem: fix wrong eem header operation
usb: fix various gadgets null ptr deref on 10gbps cabling.
usb: fix various gadget panics on 10gbps cabling
usb: typec: tcpm: cancel vdm and state machine hrtimer when unregister tcpm port
usb: typec: tcpm: cancel frs hrtimer when unregister tcpm port
regulator: core: resolve supply for boot-on/always-on regulators
regulator: max77620: Use device_set_of_node_from_dev()
regulator: bd718x7: Fix the BUCK7 voltage setting on BD71837
regulator: fan53880: Fix missing n_voltages setting
regulator: bd71828: Fix .n_voltages settings
regulator: rtmv20: Fix .set_current_limit/.get_current_limit callbacks
phy: usb: Fix misuse of IS_ENABLED
usb: dwc3: gadget: Disable gadget IRQ during pullup disable
usb: typec: mux: Fix copy-paste mistake in typec_mux_match
drm/mcde: Fix off by 10^3 in calculation
drm/msm/a6xx: fix incorrectly set uavflagprd_inv field for A650
drm/msm/a6xx: update/fix CP_PROTECT initialization
drm/msm/a6xx: avoid shadow NULL reference in failure path
RDMA/ipoib: Fix warning caused by destroying non-initial netns
RDMA/mlx4: Do not map the core_clock page to user space unless enabled
ARM: cpuidle: Avoid orphan section warning
vmlinux.lds.h: Avoid orphan section with !SMP
tools/bootconfig: Fix error return code in apply_xbc()
phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe()
ASoC: core: Fix Null-point-dereference in fmt_single_name()
ASoC: meson: gx-card: fix sound-dai dt schema
phy: ti: Fix an error code in wiz_probe()
gpio: wcd934x: Fix shift-out-of-bounds error
perf: Fix data race between pin_count increment/decrement
sched/fair: Keep load_avg and load_sum synced
sched/fair: Make sure to update tg contrib for blocked load
sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling
x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
IB/mlx5: Fix initializing CQ fragments buffer
NFS: Fix a potential NULL dereference in nfs_get_client()
NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
perf session: Correct buffer copying when peeking events
kvm: fix previous commit for 32-bit builds
NFS: Fix use-after-free in nfs4_init_client()
NFSv4: Fix second deadlock in nfs4_evict_inode()
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
scsi: core: Fix error handling of scsi_host_alloc()
scsi: core: Fix failure handling of scsi_add_host_with_dma()
scsi: core: Put .shost_dev in failure path if host state changes to RUNNING
scsi: core: Only put parent device if host state differs from SHOST_CREATED
tracing: Correct the length check which causes memory corruption
proc: only require mm_struct for writing
Linux 5.10.44
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic64172b4e72ccb54d96000b3065dd8b33aa9fef5
In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help
prioritize important work. Given that CPU shares of root cgroup
can't be changed, leaving the tasks inside root cgroup will give
them higher share compared to the other tasks inside important
cgroups. This is mitigated by moving all tasks inside root cgroup to
a different cgroup after Android is booted. However, there are many
kernel tasks stuck in the root cgroup after the boot.
It is possible to relax kernel threads and kworkers migrations under
certain scenarios. However the patch [2] posted at upstream is not
accepted. Hence add a restricted vendor hook to notify modules when a
kernel thread is requested for cgroup migration. The modules can relax
the restrictions forced by the kernel and allow the cgroup migration.
[1] f08f049de1
[2] https://lore.kernel.org/lkml/1617714261-18111-1-git-send-email-pkondeti@codeaurora.org
Bug: 184594949
Change-Id: I445a170ba797c8bece3b4b59b7a42cdd85438f1f
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Add a vendor hook after attaching a task to a cgroup to
recognize the group_id for performance tuning
Bug: 181917687
Signed-off-by: Frankie Chang <frankie.chang@mediatek.com>
Change-Id: I603afa3d893dd575a7dcb97f83bd9eacb8315bab
(cherry picked from commit de089a37a3d248608a1d5855a4ae82ebad3ec2ab)
Changes in 5.10.17
objtool: Fix seg fault with Clang non-section symbols
Revert "dts: phy: add GPIO number and active state used for phy reset"
gpio: mxs: GPIO_MXS should not default to y unconditionally
gpio: ep93xx: fix BUG_ON port F usage
gpio: ep93xx: Fix single irqchip with multi gpiochips
tracing: Do not count ftrace events in top level enable output
tracing: Check length before giving out the filter buffer
drm/i915: Fix overlay frontbuffer tracking
arm/xen: Don't probe xenbus as part of an early initcall
cgroup: fix psi monitor for root cgroup
Revert "drm/amd/display: Update NV1x SR latency values"
drm/i915/tgl+: Make sure TypeC FIA is powered up when initializing it
drm/dp_mst: Don't report ports connected if nothing is attached to them
dmaengine: move channel device_node deletion to driver
tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
soc: ti: omap-prm: Fix boot time errors for rst_map_012 bits 0 and 1
arm64: dts: rockchip: Fix PCIe DT properties on rk3399
arm64: dts: qcom: sdm845: Reserve LPASS clocks in gcc
ARM: OMAP2+: Fix suspcious RCU usage splats for omap_enter_idle_coupled
arm64: dts: rockchip: remove interrupt-names property from rk3399 vdec node
platform/x86: hp-wmi: Disable tablet-mode reporting by default
arm64: dts: rockchip: Disable display for NanoPi R2S
ovl: perform vfs_getxattr() with mounter creds
cap: fix conversions on getxattr
ovl: skip getxattr of security labels
scsi: lpfc: Fix EEH encountering oops with NVMe traffic
x86/split_lock: Enable the split lock feature on another Alder Lake CPU
nvme-pci: ignore the subsysem NQN on Phison E16
drm/amd/display: Fix DPCD translation for LTTPR AUX_RD_INTERVAL
drm/amd/display: Add more Clock Sources to DCN2.1
drm/amd/display: Release DSC before acquiring
drm/amd/display: Fix dc_sink kref count in emulated_link_detect
drm/amd/display: Free atomic state after drm_atomic_commit
drm/amd/display: Decrement refcount of dc_sink before reassignment
riscv: virt_addr_valid must check the address belongs to linear mapping
bfq-iosched: Revert "bfq: Fix computation of shallow depth"
ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL
kallsyms: fix nonconverging kallsyms table with lld
ARM: ensure the signal page contains defined contents
ARM: kexec: fix oops after TLB are invalidated
ubsan: implement __ubsan_handle_alignment_assumption
Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs"
x86/efi: Remove EFI PGD build time checks
lkdtm: don't move ctors to .rodata
KVM: x86: cleanup CR3 reserved bits checks
cgroup-v1: add disabled controller check in cgroup1_parse_param()
dmaengine: idxd: fix misc interrupt completion
ath9k: fix build error with LEDS_CLASS=m
mt76: dma: fix a possible memory leak in mt76_add_fragment()
drm/vc4: hvs: Fix buffer overflow with the dlist handling
dmaengine: idxd: check device state before issue command
bpf: Unbreak BPF_PROG_TYPE_KPROBE when kprobe is called via do_int3
bpf: Check for integer overflow when using roundup_pow_of_two()
netfilter: xt_recent: Fix attempt to update deleted entry
selftests: netfilter: fix current year
netfilter: nftables: fix possible UAF over chains from packet path in netns
netfilter: flowtable: fix tcp and udp header checksum update
xen/netback: avoid race in xenvif_rx_ring_slots_available()
net: hdlc_x25: Return meaningful error code in x25_open
net: ipa: set error code in gsi_channel_setup()
hv_netvsc: Reset the RSC count if NVSP_STAT_FAIL in netvsc_receive()
net: enetc: initialize the RFS and RSS memories
selftests: txtimestamp: fix compilation issue
net: stmmac: set TxQ mode back to DCB after disabling CBS
ibmvnic: Clear failover_pending if unable to schedule
netfilter: conntrack: skip identical origin tuple in same zone only
scsi: scsi_debug: Fix a memory leak
x86/build: Disable CET instrumentation in the kernel for 32-bit too
net: dsa: felix: implement port flushing on .phylink_mac_link_down
net: hns3: add a check for queue_id in hclge_reset_vf_queue()
net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
net: hns3: add a check for index in hclge_get_rss_key()
firmware_loader: align .builtin_fw to 8
drm/sun4i: tcon: set sync polarity for tcon1 channel
drm/sun4i: dw-hdmi: always set clock rate
drm/sun4i: Fix H6 HDMI PHY configuration
drm/sun4i: dw-hdmi: Fix max. frequency for H6
clk: sunxi-ng: mp: fix parent rate change flag check
i2c: stm32f7: fix configuration of the digital filter
h8300: fix PREEMPTION build, TI_PRE_COUNT undefined
scripts: set proper OpenSSL include dir also for sign-file
x86/pci: Create PCI/MSI irqdomain after x86_init.pci.arch_init()
arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page
rxrpc: Fix clearance of Tx/Rx ring when releasing a call
udp: fix skb_copy_and_csum_datagram with odd segment sizes
net: dsa: call teardown method on probe failure
cpufreq: ACPI: Extend frequency tables to cover boost frequencies
cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
net: gro: do not keep too many GRO packets in napi->rx_list
net: fix iteration for sctp transport seq_files
net/vmw_vsock: fix NULL pointer dereference
net/vmw_vsock: improve locking in vsock_connect_timeout()
net: watchdog: hold device global xmit lock during tx disable
bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
vsock/virtio: update credit only if socket is not closed
vsock: fix locking in vsock_shutdown()
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
net/qrtr: restrict user-controlled length in qrtr_tun_write_iter()
ovl: expand warning in ovl_d_real()
kcov, usb: only collect coverage from __usb_hcd_giveback_urb in softirq
Linux 5.10.17
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id0300681f52b51d3f466f1e66ec3a6c25f65f4d3
[ Upstream commit 61e960b07b637f0295308ad91268501d744c21b5 ]
When mounting a cgroup hierarchy with disabled controller in cgroup v1,
all available controllers will be attached.
For example, boot with cgroup_no_v1=cpu or cgroup_disable=cpu, and then
mount with "mount -t cgroup -ocpu cpu /sys/fs/cgroup/cpu", then all
enabled controllers will be attached except cpu.
Fix this by adding disabled controller check in cgroup1_parse_param().
If the specified controller is disabled, just return error with information
"Disabled controller xx" rather than attaching all the other enabled
controllers.
Fixes: f5dfb5315d ("cgroup: take options parsing into ->parse_monolithic()")
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Reviewed-by: Zefan Li <lizefan.x@bytedance.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.5
net/sched: sch_taprio: reset child qdiscs before freeing them
mptcp: fix security context on server socket
ethtool: fix error paths in ethnl_set_channels()
ethtool: fix string set id check
md/raid10: initialize r10_bio->read_slot before use.
drm/amd/display: Add get_dig_frontend implementation for DCEx
io_uring: close a small race gap for files cancel
jffs2: Allow setting rp_size to zero during remounting
jffs2: Fix NULL pointer dereference in rp_size fs option parsing
spi: dw-bt1: Fix undefined devm_mux_control_get symbol
opp: fix memory leak in _allocate_opp_table
opp: Call the missing clk_put() on error
scsi: block: Fix a race in the runtime power management code
mm/hugetlb: fix deadlock in hugetlb_cow error path
mm: memmap defer init doesn't work as expected
lib/zlib: fix inflating zlib streams on s390
io_uring: don't assume mm is constant across submits
io_uring: use bottom half safe lock for fixed file data
io_uring: add a helper for setting a ref node
io_uring: fix io_sqe_files_unregister() hangs
uapi: move constants from <linux/kernel.h> to <linux/const.h>
tools headers UAPI: Sync linux/const.h with the kernel headers
cgroup: Fix memory leak when parsing multiple source parameters
zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
scsi: cxgb4i: Fix TLS dependency
Bluetooth: hci_h5: close serdev device and free hu in h5_close
fbcon: Disable accelerated scrolling
reiserfs: add check for an invalid ih_entry_count
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
media: gp8psk: initialize stats at power control logic
f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
ALSA: seq: Use bool for snd_seq_queue internal flags
ALSA: rawmidi: Access runtime->avail always in spinlock
bfs: don't use WARNING: string when it's just info.
ext4: check for invalid block size early when mounting a file system
fcntl: Fix potential deadlock in send_sig{io, urg}()
io_uring: check kthread stopped flag when sq thread is unparked
rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
module: set MODULE_STATE_GOING state when a module fails to load
quota: Don't overflow quota file offsets
rtc: pl031: fix resource leak in pl031_probe
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
i3c master: fix missing destroy_workqueue() on error in i3c_master_register
NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
f2fs: avoid race condition for shrinker count
f2fs: fix race of pending_pages in decompression
module: delay kobject uevent until after module init call
powerpc/64: irq replay remove decrementer overflow check
fs/namespace.c: WARN if mnt_count has become negative
watchdog: rti-wdt: fix reference leak in rti_wdt_probe
um: random: Register random as hwrng-core device
um: ubd: Submit all data segments atomically
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails
drm/amd/display: updated wm table for Renoir
tick/sched: Remove bogus boot "safety" check
s390: always clear kernel stack backchain before calling functions
io_uring: remove racy overflow list fast checks
ALSA: pcm: Clear the full allocated memory at hw_params
dm verity: skip verity work if I/O error when system is shutting down
ext4: avoid s_mb_prefetch to be zero in individual scenarios
device-dax: Fix range release
Linux 5.10.5
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2b481bfac06bafdef2cf3cc1ac2c2a4ddf9913dc
cgrp->root->release_agent_path is protected by both cgroup_mutex and
release_agent_path_lock and readers can hold either one. The
dual-locking scheme was introduced while breaking a locking dependency
issue around cgroup_mutex but doesn't make sense anymore given that
the only remaining reader which uses cgroup_mutex is
cgroup1_releaes_agent().
This patch updates cgroup1_release_agent() to use
release_agent_path_lock so that release_agent_path is always protected
only by release_agent_path_lock.
While at it, convert strlen() based empty string checks to direct
tests on the first character as suggested by Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Older (and maybe current) versions of systemd set release_agent to "" when
shutting down, but do not set notify_on_release to 0.
Since 64e90a8acb ("Introduce STATIC_USERMODEHELPER to mediate
call_usermodehelper()"), we filter out such calls when the user mode helper
path is "". However, when used in conjunction with an actual (i.e. non "")
STATIC_USERMODEHELPER, the path is never "", so the real usermode helper
will be called with argv[0] == "".
Let's avoid this by not invoking the release_agent when it is "".
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Tejun Heo <tj@kernel.org>
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
# mount | grep cgroup
# dd if=/mnt/cgroup.procs bs=1 # normal output
...
1294
1295
1296
1304
1382
584+0 records in
584+0 records out
584 bytes copied
dd: /mnt/cgroup.procs: cannot skip to specified offset
83 <<< generates end of last line
1383 <<< ... and whole last line once again
0+1 records in
0+1 records out
8 bytes copied
dd: /mnt/cgroup.procs: cannot skip to specified offset
1386 <<< generates last line anyway
0+1 records in
0+1 records out
5 bytes copied
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tiny steps to deal with merge issues in sdcardfs due to fs param passing
api changes.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I03ba8763e8cc324c25fb6316c363b59957103474
Android expects system_server to be able to move tasks between different
cgroups/cpusets, but does not want to be running as root. Let's relax
permission check so that processes can move other tasks if they have
CAP_SYS_NICE in the affected task's user namespace.
BUG=b:31790445,chromium:647994
Bug: 147109865
TEST=Boot android container, examine logcat
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394927
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
[AmitP: Refactored original changes to align with upstream commit
201af4c0fa ("cgroup: move cgroup files under kernel/cgroup/")]
Change-Id: Ia919c66ab6ed6a6daf7c4cf67feb38b13b1ad09b
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
(cherry picked from commit ec54762b84a1d06de188bc846655305d3f7acf75)
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c3 ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.
The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.
There are several places that need to synchronize with migration:
a) do_exit,
b) de_thread,
c) copy_process,
d) cgroup_update_dfl_csses,
e) parallel migration (cgroup_{proc,thread}s_write).
In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.
This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.
This change improves migration dependent workload performance similar
to per-signal_struct state.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of using its own logic for k-/vmalloc rely on
kvmalloc which is actually doing quite the same.
Signed-off-by: Marc Koderer <marc@koderer.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.
Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Add some logging to the core users of the fs_context log so that
information can be extracted from them as to the reason for failure.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pass it fs_context instead of fs_type/flags/root triple, have
it return int instead of dentry and make it deal with setting
fc->root.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that this reference is *NOT* contributing to refcount of
cgroup_root in question and is valid only until cgroup_do_mount()
returns.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Store the results in cgroup_fs_context. There's a nasty twist caused
by the enabling/disabling subsystems - we can't do the checks sensitive
to that until cgroup_mutex gets grabbed. Frankly, these checks are
complete bullshit (e.g. all,none combination is accepted if all subsystems
are disabled; so's cpusets,none and all,cpusets when cpusets is disabled,
etc.), but touching that would be a userland-visible behaviour change ;-/
So we do parsing in ->parse_monolithic() and have the consistency checks
done in check_cgroupfs_options(), with the latter called (on already parsed
options) from cgroup1_get_tree() and cgroup1_reconfigure().
Freeing the strdup'ed strings is done from fs_context destructor, which
somewhat simplifies the life for cgroup1_{get_tree,reconfigure}().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unfortunately, cgroup is tangled into kernfs infrastructure.
To avoid converting all kernfs-based filesystems at once,
we need to untangle the remount part of things, instead of
having it go through kernfs_sop_remount_fs(). Fortunately,
it's not hard to do.
This commit just gets cgroup/cgroup1 to use fs_context to
deliver options on mount and remount paths. Parsing those
is going to be done in the next commits; for now we do
pretty much what legacy case does.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make the reference from superblock to cgroup_root counting -
do cgroup_put() in cgroup_kill_sb() whether we'd done
percpu_ref_kill() or not; matching grab is done when we allocate
a new root. That gives the same refcounting rules for all callers
of cgroup_do_mount() - a reference to cgroup_root has been grabbed
by caller and it either is transferred to new superblock or dropped.
* have cgroup_kill_sb() treat an already killed refcount as "just
don't bother killing it, then".
* after successful cgroup_do_mount() have cgroup1_mount() recheck
if we'd raced with mount/umount from somebody else and cgroup_root
got killed. In that case we drop the superblock and bugger off
with -ERESTARTSYS, same as if we'd found it in the list already
dying.
* don't bother with delayed initialization of refcount - it's
unreliable and not needed. No need to prevent attempts to bump
the refcount if we find cgroup_root of another mount in progress -
sget will reuse an existing superblock just fine and if the
other sb manages to die before we get there, we'll catch
that immediately after cgroup_do_mount().
* don't bother with kernfs_pin_sb() - no need for doing that
either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It can be useful to inhibit all cgroup1 hierarchies especially during
transition and for debugging. cgroup_no_v1 can block hierarchies with
controllers which leaves out the named hierarchies. Expand it to
cover the named hierarchies so that "cgroup_no_v1=all,named" disables
all cgroup1 hierarchies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Marcin Pawlowski <mpawlowski@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Deadlock during cgroup migration from cpu hotplug path when a task T is
being moved from source to destination cgroup.
kworker/0:0
cpuset_hotplug_workfn()
cpuset_hotplug_update_tasks()
hotplug_update_tasks_legacy()
remove_tasks_in_empty_cpuset()
cgroup_transfer_tasks() // stuck in iterator loop
cgroup_migrate()
cgroup_migrate_add_task()
In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
Task T will not migrate to destination cgroup. css_task_iter_start()
will keep pointing to task T in loop waiting for task T cg_list node
to be removed.
Task T
do_exit()
exit_signals() // sets PF_EXITING
exit_task_namespaces()
switch_task_namespaces()
free_nsproxy()
put_mnt_ns()
drop_collected_mounts()
namespace_unlock()
synchronize_rcu()
_synchronize_rcu_expedited()
schedule_work() // on cpu0 low priority worker pool
wait_event() // waiting for work item to execute
Task T inserted a work item in the worklist of cpu0 low priority
worker pool. It is waiting for expedited grace period work item
to execute. This work item will only be executed once kworker/0:0
complete execution of cpuset_hotplug_workfn().
kworker/0:0 ==> Task T ==>kworker/0:0
In case of PF_EXITING task being migrated from source to destination
cgroup, migrate next available task in source cgroup.
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs
filesystem to enable cpuset controller to use v2 behavior in a v1
cgroup. This mount option applies only to cpuset controller and have
no effect on other controllers.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.
For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".
While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2. This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.
While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.
v3: - Restructured so that cgroup_procs_write_permission() takes
@src_cgrp and @dst_cgrp.
v2: - Rolled in Waiman's task reference count fix.
- Updated on top of nsdelegate changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
The debug cgroup currently resides within cgroup-v1.c and is enabled
only for v1 cgroup. To enable the debug cgroup also for v2, it makes
sense to put the code into its own file as it will no longer be v1
specific. There is no change to the debug cgroup specific code.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.
tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
Refreshed on top of cgroup/for-v4.13 which dropped on
css_set_populated() -> nr_tasks conversion.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
"Nothing major. Two notable fixes are Li's second stab at fixing the
long-standing race condition in the mount path and suppression of
spurious warning from cgroup_get(). All other changes are trivial"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark cgroup_get() with __maybe_unused
cgroup: avoid attaching a cgroup root to two different superblocks, take 2
cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup: move cgroup_subsys_state parent field for cache locality
cpuset: Remove cpuset_update_active_cpus()'s parameter.
cgroup: switch to BUG_ON()
cgroup: drop duplicate header nsproxy.h
kernel: convert css_set.refcount from atomic_t to refcount_t
kernel: convert cgroup_namespace.count from atomic_t to refcount_t