Changes in 5.10.38
KEYS: trusted: Fix memory leak on object td
tpm: fix error return code in tpm2_get_cc_attrs_tbl()
tpm, tpm_tis: Extend locality handling to TPM2 in tpm_tis_gen_interrupt()
tpm, tpm_tis: Reserve locality in tpm_tis_resume()
KVM: x86/mmu: Remove the defunct update_pte() paging hook
KVM/VMX: Invoke NMI non-IST entry instead of IST entry
ACPI: PM: Add ACPI ID of Alder Lake Fan
PM: runtime: Fix unpaired parent child_count for force_resume
cpufreq: intel_pstate: Use HWP if enabled by platform firmware
kvm: Cap halt polling at kvm->max_halt_poll_ns
ath11k: fix thermal temperature read
fs: dlm: fix debugfs dump
fs: dlm: add errno handling to check callback
fs: dlm: check on minimum msglen size
fs: dlm: flush swork on shutdown
tipc: convert dest node's address to network order
ASoC: Intel: bytcr_rt5640: Enable jack-detect support on Asus T100TAF
net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath
net: stmmac: Set FIFO sizes for ipq806x
ASoC: rsnd: core: Check convert rate in rsnd_hw_params
Bluetooth: Fix incorrect status handling in LE PHY UPDATE event
i2c: bail out early when RDWR parameters are wrong
ALSA: hdsp: don't disable if not enabled
ALSA: hdspm: don't disable if not enabled
ALSA: rme9652: don't disable if not enabled
ALSA: bebob: enable to deliver MIDI messages for multiple ports
Bluetooth: Set CONF_NOT_COMPLETE as l2cap_chan default
Bluetooth: initialize skb_queue_head at l2cap_chan_create()
net/sched: cls_flower: use ntohs for struct flow_dissector_key_ports
net: bridge: when suppression is enabled exclude RARP packets
Bluetooth: check for zapped sk before connecting
selftests/powerpc: Fix L1D flushing tests for Power10
powerpc/32: Statically initialise first emergency context
net: hns3: remediate a potential overflow risk of bd_num_list
net: hns3: add handling for xmit skb with recursive fraglist
ip6_vti: proper dev_{hold|put} in ndo_[un]init methods
ASoC: Intel: bytcr_rt5640: Add quirk for the Chuwi Hi8 tablet
ice: handle increasing Tx or Rx ring sizes
Bluetooth: btusb: Enable quirk boolean flag for Mediatek Chip.
ASoC: rt5670: Add a quirk for the Dell Venue 10 Pro 5055
i2c: Add I2C_AQ_NO_REP_START adapter quirk
MIPS: Loongson64: Use _CACHE_UNCACHED instead of _CACHE_UNCACHED_ACCELERATED
coresight: Do not scan for graph if none is present
IB/hfi1: Correct oversized ring allocation
mac80211: clear the beacon's CRC after channel switch
pinctrl: samsung: use 'int' for register masks in Exynos
rtw88: 8822c: add LC calibration for RTL8822C
mt76: mt7615: support loading EEPROM for MT7613BE
mt76: mt76x0: disable GTK offloading
mt76: mt7915: fix txpower init for TSSI off chips
fuse: invalidate attrs when page writeback completes
virtiofs: fix userns
cuse: prevent clone
iwlwifi: pcie: make cfg vs. trans_cfg more robust
powerpc/mm: Add cond_resched() while removing hpte mappings
ASoC: rsnd: call rsnd_ssi_master_clk_start() from rsnd_ssi_init()
Revert "iommu/amd: Fix performance counter initialization"
iommu/amd: Remove performance counter pre-initialization test
drm/amd/display: Force vsync flip when reconfiguring MPCC
selftests: Set CC to clang in lib.mk if LLVM is set
kconfig: nconf: stop endless search loops
ALSA: hda/realtek: Add quirk for Lenovo Ideapad S740
ASoC: Intel: sof_sdw: add quirk for new ADL-P Rvp
ALSA: hda/hdmi: fix race in handling acomp ELD notification at resume
sctp: Fix out-of-bounds warning in sctp_process_asconf_param()
flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target()
powerpc/smp: Set numa node before updating mask
ASoC: rt286: Generalize support for ALC3263 codec
ethtool: ioctl: Fix out-of-bounds warning in store_link_ksettings_for_user()
net: sched: tapr: prevent cycle_time == 0 in parse_taprio_schedule
samples/bpf: Fix broken tracex1 due to kprobe argument change
powerpc/pseries: Stop calling printk in rtas_stop_self()
drm/amd/display: fixed divide by zero kernel crash during dsc enablement
drm/amd/display: add handling for hdcp2 rx id list validation
drm/amdgpu: Add mem sync flag for IB allocated by SA
mt76: mt7615: fix entering driver-own state on mt7663
crypto: ccp: Free SEV device if SEV init fails
wl3501_cs: Fix out-of-bounds warnings in wl3501_send_pkt
wl3501_cs: Fix out-of-bounds warnings in wl3501_mgmt_join
qtnfmac: Fix possible buffer overflow in qtnf_event_handle_external_auth
powerpc/iommu: Annotate nested lock for lockdep
iavf: remove duplicate free resources calls
net: ethernet: mtk_eth_soc: fix RX VLAN offload
selftests: mlxsw: Increase the tolerance of backlog buildup
selftests: mlxsw: Fix mausezahn invocation in ERSPAN scale test
kbuild: generate Module.symvers only when vmlinux exists
bnxt_en: Add PCI IDs for Hyper-V VF devices.
ia64: module: fix symbolizer crash on fdescr
watchdog: rename __touch_watchdog() to a better descriptive name
watchdog: explicitly update timestamp when reporting softlockup
watchdog/softlockup: remove logic that tried to prevent repeated reports
watchdog: fix barriers when printing backtraces from all CPUs
ASoC: rt286: Make RT286_SET_GPIO_* readable and writable
thermal: thermal_of: Fix error return code of thermal_of_populate_bind_params()
f2fs: move ioctl interface definitions to separated file
f2fs: fix compat F2FS_IOC_{MOVE,GARBAGE_COLLECT}_RANGE
f2fs: fix to allow migrating fully valid segment
f2fs: fix panic during f2fs_resize_fs()
f2fs: fix a redundant call to f2fs_balance_fs if an error occurs
remoteproc: qcom_q6v5_mss: Replace ioremap with memremap
remoteproc: qcom_q6v5_mss: Validate p_filesz in ELF loader
PCI: iproc: Fix return value of iproc_msi_irq_domain_alloc()
PCI: Release OF node in pci_scan_device()'s error path
ARM: 9064/1: hw_breakpoint: Do not directly check the event's overflow_handler hook
f2fs: fix to align to section for fallocate() on pinned file
f2fs: fix to update last i_size if fallocate partially succeeds
PCI: endpoint: Make *_get_first_free_bar() take into account 64 bit BAR
PCI: endpoint: Add helper API to get the 'next' unreserved BAR
PCI: endpoint: Make *_free_bar() to return error codes on failure
PCI: endpoint: Fix NULL pointer dereference for ->get_features()
f2fs: fix to avoid touching checkpointed data in get_victim()
f2fs: fix to cover __allocate_new_section() with curseg_lock
f2fs: Fix a hungtask problem in atomic write
f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
rpmsg: qcom_glink_native: fix error return code of qcom_glink_rx_data()
NFS: nfs4_bitmask_adjust() must not change the server global bitmasks
NFS: Fix attribute bitmask in _nfs42_proc_fallocate()
NFSv4.2: Always flush out writes in nfs42_proc_fallocate()
NFS: Deal correctly with attribute generation counter overflow
PCI: endpoint: Fix missing destroy_workqueue()
pNFS/flexfiles: fix incorrect size check in decode_nfs_fh()
NFSv4.2 fix handling of sr_eof in SEEK's reply
SUNRPC: Move fault injection call sites
SUNRPC: Remove trace_xprt_transmit_queued
SUNRPC: Handle major timeout in xprt_adjust_timeout()
thermal/drivers/tsens: Fix missing put_device error
NFSv4.x: Don't return NFS4ERR_NOMATCHING_LAYOUT if we're unmounting
nfsd: ensure new clients break delegations
rtc: fsl-ftm-alarm: add MODULE_TABLE()
dmaengine: idxd: Fix potential null dereference on pointer status
dmaengine: idxd: fix dma device lifetime
dmaengine: idxd: fix cdev setup and free device lifetime issues
SUNRPC: fix ternary sign expansion bug in tracing
pwm: atmel: Fix duty cycle calculation in .get_state()
xprtrdma: Avoid Receive Queue wrapping
xprtrdma: Fix cwnd update ordering
xprtrdma: rpcrdma_mr_pop() already does list_del_init()
swiotlb: Fix the type of index
ceph: fix inode leak on getattr error in __fh_to_dentry
scsi: qla2xxx: Prevent PRLI in target mode
scsi: ufs: core: Do not put UFS power into LPM if link is broken
scsi: ufs: core: Cancel rpm_dev_flush_recheck_work during system suspend
scsi: ufs: core: Narrow down fast path in system suspend path
rtc: ds1307: Fix wday settings for rx8130
net: hns3: fix incorrect configuration for igu_egu_hw_err
net: hns3: initialize the message content in hclge_get_link_mode()
net: hns3: add check for HNS3_NIC_STATE_INITED in hns3_reset_notify_up_enet()
net: hns3: fix for vxlan gpe tx checksum bug
net: hns3: use netif_tx_disable to stop the transmit queue
net: hns3: disable phy loopback setting in hclge_mac_start_phy
sctp: do asoc update earlier in sctp_sf_do_dupcook_a
RISC-V: Fix error code returned by riscv_hartid_to_cpuid()
sunrpc: Fix misplaced barrier in call_decode
libbpf: Fix signed overflow in ringbuf_process_ring
block/rnbd-clt: Change queue_depth type in rnbd_clt_session to size_t
block/rnbd-clt: Check the return value of the function rtrs_clt_query
ethernet:enic: Fix a use after free bug in enic_hard_start_xmit
sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b
netfilter: xt_SECMARK: add new revision to fix structure layout
xsk: Fix for xp_aligned_validate_desc() when len == chunk_size
net: stmmac: Clear receive all(RA) bit when promiscuous mode is off
drm/radeon: Fix off-by-one power_state index heap overwrite
drm/radeon: Avoid power table parsing memory leaks
arm64: entry: factor irq triage logic into macros
arm64: entry: always set GIC_PRIO_PSR_I_SET during entry
khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate()
mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
ksm: fix potential missing rmap_item for stable_node
mm/gup: check every subpage of a compound page during isolation
mm/gup: return an error on migration failure
mm/gup: check for isolation errors
ethtool: fix missing NLM_F_MULTI flag when dumping
net: fix nla_strcmp to handle more then one trailing null character
smc: disallow TCP_ULP in smc_setsockopt()
netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
netfilter: nftables: Fix a memleak from userdata error path in new objects
can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path
can: mcp251x: fix resume from sleep before interface was brought up
can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
sched: Fix out-of-bound access in uclamp
sched/fair: Fix unfairness caused by missing load decay
fs/proc/generic.c: fix incorrect pde_is_permanent check
kernel: kexec_file: fix error return code of kexec_calculate_store_digests()
kernel/resource: make walk_system_ram_res() find all busy IORESOURCE_SYSTEM_RAM resources
kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources
netfilter: nftables: avoid overflows in nft_hash_buckets()
i40e: fix broken XDP support
i40e: Fix use-after-free in i40e_client_subtask()
i40e: fix the restart auto-negotiation after FEC modified
i40e: Fix PHY type identifiers for 2.5G and 5G adapters
mptcp: fix splat when closing unaccepted socket
f2fs: avoid unneeded data copy in f2fs_ioc_move_range()
ARC: entry: fix off-by-one error in syscall number validation
ARC: mm: PAE: use 40-bit physical page mask
ARC: mm: Use max_high_pfn as a HIGHMEM zone border
powerpc/64s: Fix crashes when toggling stf barrier
powerpc/64s: Fix crashes when toggling entry flush barrier
hfsplus: prevent corruption in shrinking truncate
squashfs: fix divide error in calculate_skip()
userfaultfd: release page in error path to avoid BUG_ON
kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
blk-iocost: fix weight updates of inner active iocgs
arm64: mte: initialize RGSR_EL1.SEED in __cpu_setup
arm64: Fix race condition on PG_dcache_clean in __sync_icache_dcache()
btrfs: fix race leading to unpersisted data and metadata on fsync
drm/radeon/dpm: Disable sclk switching on Oland when two 4K 60Hz monitors are connected
drm/amd/display: Initialize attribute for hdcp_srm sysfs file
drm/i915: Avoid div-by-zero on gen2
kvm: exit halt polling on need_resched() as well
KVM: LAPIC: Accurately guarantee busy wait for timer to expire when using hv_timer
drm/msm/dp: initialize audio_comp when audio starts
KVM: x86: Cancel pvclock_gtod_work on module removal
KVM: x86: Prevent deadlock against tk_core.seq
dax: Add an enum for specifying dax wakup mode
dax: Add a wakeup mode parameter to put_unlocked_entry()
dax: Wake up all waiters after invalidating dax entry
xen/unpopulated-alloc: consolidate pgmap manipulation
xen/unpopulated-alloc: fix error return code in fill_list()
perf tools: Fix dynamic libbpf link
usb: dwc3: gadget: Free gadget structure only after freeing endpoints
iio: light: gp2ap002: Fix rumtime PM imbalance on error
iio: proximity: pulsedlight: Fix rumtime PM imbalance on error
iio: hid-sensors: select IIO_TRIGGERED_BUFFER under HID_SENSOR_IIO_TRIGGER
usb: fotg210-hcd: Fix an error message
hwmon: (occ) Fix poll rate limiting
usb: musb: Fix an error message
ACPI: scan: Fix a memory leak in an error handling path
kyber: fix out of bounds access when preempted
nvmet: add lba to sect conversion helpers
nvmet: fix inline bio check for bdev-ns
nvmet-rdma: Fix NULL deref when SEND is completed with error
f2fs: compress: fix to free compress page correctly
f2fs: compress: fix race condition of overwrite vs truncate
f2fs: compress: fix to assign cc.cluster_idx correctly
nbd: Fix NULL pointer in flush_workqueue
blk-mq: plug request for shared sbitmap
blk-mq: Swap two calls in blk_mq_exit_queue()
usb: dwc3: omap: improve extcon initialization
usb: dwc3: pci: Enable usb2-gadget-lpm-disable for Intel Merrifield
usb: xhci: Increase timeout for HC halt
usb: dwc2: Fix gadget DMA unmap direction
usb: core: hub: fix race condition about TRSMRCY of resume
usb: dwc3: gadget: Enable suspend events
usb: dwc3: gadget: Return success always for kick transfer in ep queue
usb: typec: ucsi: Retrieve all the PDOs instead of just the first 4
usb: typec: ucsi: Put fwnode in any case during ->probe()
xhci-pci: Allow host runtime PM as default for Intel Alder Lake xHCI
xhci: Do not use GFP_KERNEL in (potentially) atomic context
xhci: Add reset resume quirk for AMD xhci controller.
iio: gyro: mpu3050: Fix reported temperature value
iio: tsl2583: Fix division by a zero lux_val
cdc-wdm: untangle a circular dependency between callback and softint
xen/gntdev: fix gntdev_mmap() error exit path
KVM: x86: Emulate RDPID only if RDTSCP is supported
KVM: x86: Move RDPID emulation intercept to its own enum
KVM: nVMX: Always make an attempt to map eVMCS after migration
KVM: VMX: Do not advertise RDPID if ENABLE_RDTSCP control is unsupported
KVM: VMX: Disable preemption when probing user return MSRs
Revert "iommu/vt-d: Remove WO permissions on second-level paging entries"
Revert "iommu/vt-d: Preset Access/Dirty bits for IOVA over FL"
iommu/vt-d: Preset Access/Dirty bits for IOVA over FL
iommu/vt-d: Remove WO permissions on second-level paging entries
mm: fix struct page layout on 32-bit systems
MIPS: Reinstate platform `__div64_32' handler
MIPS: Avoid DIVU in `__div64_32' is result would be zero
MIPS: Avoid handcoded DIVU in `__div64_32' altogether
clocksource/drivers/timer-ti-dm: Prepare to handle dra7 timer wrap issue
clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940
ARM: 9011/1: centralize phys-to-virt conversion of DT/ATAGS address
ARM: 9012/1: move device tree mapping out of linear region
ARM: 9020/1: mm: use correct section size macro to describe the FDT virtual address
ARM: 9027/1: head.S: explicitly map DT even if it lives in the first physical section
usb: typec: tcpm: Fix error while calculating PPS out values
kobject_uevent: remove warning in init_uevent_argv()
drm/i915/gt: Fix a double free in gen8_preallocate_top_level_pdp
drm/i915: Read C0DRB3/C1DRB3 as 16 bits again
drm/i915/overlay: Fix active retire callback alignment
drm/i915: Fix crash in auto_retire
clk: exynos7: Mark aclk_fsys1_200 as critical
media: rkvdec: Remove of_match_ptr()
i2c: mediatek: Fix send master code at more than 1MHz
dt-bindings: media: renesas,vin: Make resets optional on R-Car Gen1
dt-bindings: serial: 8250: Remove duplicated compatible strings
debugfs: Make debugfs_allow RO after init
ext4: fix debug format string warning
nvme: do not try to reconfigure APST when the controller is not live
ASoC: rsnd: check all BUSIF status when error
Linux 5.10.38
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia32e01283b488a38be48015c58a0e481f09aaf65
[ Upstream commit c89a384e2551c692a9fe60d093fd7080f50afc51 ]
When removing rmap_item from stable tree, STABLE_FLAG of rmap_item is
cleared with head reserved. So the following scenario might happen: For
ksm page with rmap_item1:
cmp_and_merge_page
stable_node->head = &migrate_nodes;
remove_rmap_item_from_tree, but head still equal to stable_node;
try_to_merge_with_ksm_page failed;
return;
For the same ksm page with rmap_item2, stable node migration succeed this
time. The stable_node->head does not equal to migrate_nodes now. For ksm
page with rmap_item1 again:
cmp_and_merge_page
stable_node->head != &migrate_nodes && rmap_item->head == stable_node
return;
We would miss the rmap_item for stable_node and might result in failed
rmap_walk_ksm(). Fix this by set rmap_item->head to NULL when rmap_item
is removed from stable tree.
Link: https://lkml.kernel.org/r/20210330140228.45635-5-linmiaohe@huawei.com
Fixes: 4146d2d673 ("ksm: make !merge_across_nodes migration safe")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
KSM uses follow_page with FOLL_GET at several places so they should
use put_user_page to avoid page_pinner false positive.
Bug: 183414571
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I0234eeb5db7d801e70c4884146c3029582b715c1
The :c:type:`foo` only works properly with structs before
Sphinx 3.x.
On Sphinx 3.x, structs should now be declared using the
.. c:struct, and referenced via :c:struct tag.
As we now have the automarkup.py macro, that automatically
convert:
struct foo
into cross-references, let's get rid of that, solving
several warnings when building docs with Sphinx 3.x.
Reviewed-by: André Almeida <andrealmeid@collabora.com> # blk-mq.rst
Reviewed-by: Takashi Iwai <tiwai@suse.de> # sound
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Patch series "mm: fixes to past from future testing".
Here's a set of independent fixes against 5.9-rc2: prompted by
testing Alex Shi's "warning on !memcg" and lru_lock series, but
I think fit for 5.9 - though maybe only the first for stable.
This patch (of 5):
In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte()
were removed, on the understanding that swap cache is now already charged
at those points; but a case was missed, when ksm_might_need_to_copy() has
decided it must allocate a substitute page: such pages were never charged.
Fix it inside ksm_might_need_to_copy().
This was discovered by Alex Shi's prospective commit "mm/memcg: warning on
!memcg after readahead page charged".
But there is a another surprise: this also fixes some rarer uncharged
PageAnon cases, when KSM is configured in, but has never been activated.
ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check
sometimes catches a case which would need to have been copied if KSM were
turned on. Or that's my optimistic interpretation (of my own old code),
but it leaves some doubt as to whether everything is working as intended
there - might it hint at rare anon ptes which rmap cannot find? A
question not easily answered: put in the fix for missed memcg charges.
Cc; Matthew Wilcox <willy@infradead.org>
Fixes: 4c6355b25e ("mm: memcontrol: charge swapin pages on instantiation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org> [5.8]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301343270.5954@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301358020.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge emailed patches from Peter Xu:
"This is a small series that I picked up from Linus's suggestion to
simplify cow handling (and also make it more strict) by checking
against page refcounts rather than mapcounts.
This makes uffd-wp work again (verified by running upmapsort)"
Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:
8 files changed, 29 insertions(+), 120 deletions(-)
The reason for the bad timing is that it turns out that commit
17839856fd ("gup: document and work around 'COW can break either way'
issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
Patocka also reports that it caused issues for strace when running in a
DAX environment with ext4 on a persistent memory setup.
And we can't just revert that commit without re-introducing the original
issue that is a potential security hole, so making COW stricter (and in
the process much simpler) is a step to then undoing the forced COW that
broke other uses.
Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/
* emailed patches from Peter Xu <peterx@redhat.com>:
mm: Add PGREUSE counter
mm/gup: Remove enfornced COW mechanism
mm/ksm: Remove reuse_ksm_page()
mm: do_wp_page() simplification
Remove the function as the last reference has gone away with the do_wp_page()
changes.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Michael Ellerman:
- Add support for (optionally) using queued spinlocks & rwlocks.
- Support for a new faster system call ABI using the scv instruction on
Power9 or later.
- Drop support for the PROT_SAO mmap/mprotect flag as it will be
unsupported on Power10 and future processors, leaving us with no way
to implement the functionality it requests. This risks breaking
userspace, though we believe it is unused in practice.
- A bug fix for, and then the removal of, our custom stack expansion
checking. We now allow stack expansion up to the rlimit, like other
architectures.
- Remove the remnants of our (previously disabled) topology update
code, which tried to react to NUMA layout changes on virtualised
systems, but was prone to crashes and other problems.
- Add PMU support for Power10 CPUs.
- A change to our signal trampoline so that we don't unbalance the link
stack (branch return predictor) in the signal delivery path.
- Lots of other cleanups, refactorings, smaller features and so on as
usual.
Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
Wei Yongjun, Wen Xiong, YueHaibing.
* tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
selftests/powerpc: Fix pkey syscall redefinitions
powerpc: Fix circular dependency between percpu.h and mmu.h
powerpc/powernv/sriov: Fix use of uninitialised variable
selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
powerpc/40x: Fix assembler warning about r0
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
cpuidle: pseries: Fixup exit latency for CEDE(0)
cpuidle: pseries: Add function to parse extended CEDE records
cpuidle: pseries: Set the latency-hint before entering CEDE
selftests/powerpc: Fix online CPU selection
powerpc/perf: Consolidate perf_callchain_user_[64|32]()
powerpc/pseries/hotplug-cpu: Remove double free in error path
powerpc/pseries/mobility: Add pr_debug() for device tree changes
powerpc/pseries/mobility: Set pr_fmt()
powerpc/cacheinfo: Warn if cache object chain becomes unordered
powerpc/cacheinfo: Improve diagnostics about malformed cache lists
powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
powerpc/cacheinfo: Set pr_fmt()
powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
...
ISA v3.1 does not support the SAO storage control attribute required to
implement PROT_SAO. PROT_SAO was used by specialised system software
(Lx86) that has been discontinued for about 7 years, and is not thought
to be used elsewhere, so removal should not cause problems.
We rather remove it than keep support for older processors, because
live migrating guest partitions to newer processors may not be possible
if SAO is in use (or worse allowed with silent races).
- PROT_SAO stays in the uapi header so code using it would still build.
- arch_validate_prot() is removed, the generic version rejects PROT_SAO
so applications would get a failure at mmap() time.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop KVM change for the time being]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-3-npiggin@gmail.com
Pull more KVM updates from Paolo Bonzini:
- PPC secure guest support
- small x86 cleanup
- fix for an x86-specific out-of-bounds write on a ioctl (not guest
triggerable, data not attacker-controlled)
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: vmx: Stop wasting a page for guest_msrs
KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
Documentation: kvm: Fix mention to number of ioctls classes
powerpc: Ultravisor: Add PPC_UV config option
KVM: PPC: Book3S HV: Support reset of secure guest
KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM
KVM: PPC: Book3S HV: Radix changes for secure guest
KVM: PPC: Book3S HV: Shared pages support for secure guests
KVM: PPC: Book3S HV: Support for running secure guests
mm: ksm: Export ksm_madvise()
KVM x86: Move kvm cpuid support out of svm
On PEF-enabled POWER platforms that support running of secure guests,
secure pages of the guest are represented by device private pages
in the host. Such pages needn't participate in KSM merging. This is
achieved by using ksm_madvise() call which need to be exported
since KVM PPC can be a kernel module.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
remove_stable_node() when it races with __mmput() and squeezes in
between ksm_exit() and exit_mmap().
WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150
Call Trace:
remove_all_stable_nodes+0x12b/0x330
run_store+0x4ef/0x7b0
kernfs_fop_write+0x200/0x420
vfs_write+0x154/0x450
ksys_write+0xf9/0x1d0
do_syscall_64+0x99/0x510
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Remove the warning as there is nothing scary going on.
Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
Fixes: cbf86cfe04 ("ksm: remove old stable nodes more thoroughly")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).
Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.
This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.
The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.
This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:
%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }
@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%
Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place
Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e. KSM page rmap
walk. Basically it is not a big deal since commit 2c653d0ee2 ("ksm:
introduce ksm_max_page_sharing per page deduplication limit"), since
max_page_sharing limits the number of shared KSM pages.
But it still sounds not worth waiting for the lock, the page can be
skip, then try to merge it in the next scan to avoid potential stall if
its content is still intact.
Introduce trylock mode to get_ksm_page() to not block on page lock, like
what try_to_merge_one_page() does. And, define three possible
operations (nolock, lock and trylock) as enum type to avoid stacking up
bools and make the code more readable.
Return -EBUSY if trylock fails, since NULL means not find suitable KSM
page, which is a valid case.
With the default max_page_sharing setting (256), there is almost no
observed change comparing lock vs trylock.
However, with ksm02 of LTP, the reduced ksmd full scan time can be
observed, which has set max_page_sharing to 786432. With lock version,
ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
may take 8s - 11s to run two full scans. And, the number of
pages_sharing and pages_to_scan keep same. Basically, this change has
no harm.
[hughd@google.com: fix BUG_ON()]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages. If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.
[ Note that we do not consider PageSwapCache() pages at least for now,
since we don't want to complicate __get_ksm_page(), which has nice
optimization based on this (for the migration case). Currenly it is
spinning on PageSwapCache() pages, waiting for when they have
unfreezed counters (i.e., for the migration finish). But we don't want
to make it also spinning on swap cache pages, which we try to reuse,
since there is not a very high probability to reuse them. So, for now
we do not consider PageSwapCache() pages at all. ]
So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree. Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing. After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.
[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.
1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"
This patch (of 2):
At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.
Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ksm thread unconditionally sleeps in ksm_scan_thread() after each
iteration:
schedule_timeout_interruptible(
msecs_to_jiffies(ksm_thread_sleep_millisecs))
The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.
In case of user writes a big value by a mistake, and the thread enters
into schedule_timeout_interruptible(), it's not possible to cancel the
sleep by writing a new smaler value; the thread is just sleeping till
timeout expires.
The patch fixes the problem by waking the thread each time after the value
is updated.
This also may be useful for debug purposes; and also for userspace
daemons, which change sleep_millisecs value in dependence of system load.
Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace jhash2 with xxhash.
Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c hash() 12081 MB/s
ksm: xxh64 hash() 8770 MB/s
ksm: xxh32 hash() 4529 MB/s
ksm: jhash2 hash() 1569 MB/s
Sioh Lee did some testing:
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns
As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use
it in ksm at all.
Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.
Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: leesioh <solee@os.korea.ac.kr>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit cafa0010cd ("Raise the minimum required gcc version to 4.6")
recently exposed a brittle part of the build for supporting non-gcc
compilers.
Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
__GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
added compiler specific checks for __clang__ or __INTEL_COMPILER.
This is brittle, as they happened to get compatibility by posing as a
certain version of GCC. This broke when upgrading the minimal version
of GCC required to build the kernel, to a version above what ICC and
Clang claim to be.
Rather than always including compiler-gcc.h then undefining or
redefining macros in compiler-intel.h or compiler-clang.h, let's
separate out the compiler specific macro definitions into mutually
exclusive headers, do more proper compiler detection, and keep shared
definitions in compiler_types.h.
Fixes: cafa0010cd ("Raise the minimum required gcc version to 4.6")
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Eli Friedman <efriedma@codeaurora.org>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/
VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.
Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.
DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().
This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.
Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
PAGE_SIZE unaligned for rmap_item->address under memory pressure
tests(start 20 guests and run memhog in the host).
WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
Call trace:
kvm_age_hva_handler+0xc0/0xc8
handle_hva_to_gpa+0xa8/0xe0
kvm_age_hva+0x4c/0xe8
kvm_mmu_notifier_clear_flush_young+0x54/0x98
__mmu_notifier_clear_flush_young+0x6c/0xa0
page_referenced_one+0x154/0x1d8
rmap_walk_ksm+0x12c/0x1d0
rmap_walk+0x94/0xa0
page_referenced+0x194/0x1b0
shrink_page_list+0x674/0xc28
shrink_inactive_list+0x26c/0x5b8
shrink_node_memcg+0x35c/0x620
shrink_node+0x100/0x430
do_try_to_free_pages+0xe0/0x3a8
try_to_free_pages+0xe4/0x230
__alloc_pages_nodemask+0x564/0xdc0
alloc_pages_vma+0x90/0x228
do_anonymous_page+0xc8/0x4d0
__handle_mm_fault+0x4a0/0x508
handle_mm_fault+0xf8/0x1b0
do_page_fault+0x218/0x4b8
do_translation_fault+0x90/0xa0
do_mem_abort+0x68/0xf0
el0_da+0x24/0x28
In rmap_walk_ksm, the rmap_item->address might still have the
STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
on arm64.
This patch fixes it by ignoring (not removing) the low bits of address
when doing rmap_walk_ksm.
IMO, it should be backported to stable tree. the storm of WARN_ONs is
very easy for me to reproduce. More than that, I watched a panic (not
reproducible) as follows:
page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
flags: 0x1fffc00000000000()
raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
page dumped because: nonzero _refcount
CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
Hardware name: <snip for confidential issues>
Call trace:
dump_backtrace+0x0/0x22c
show_stack+0x24/0x2c
dump_stack+0x8c/0xb0
bad_page+0xf4/0x154
free_pages_check_bad+0x90/0x9c
free_pcppages_bulk+0x464/0x518
free_hot_cold_page+0x22c/0x300
__put_page+0x54/0x60
unmap_stage2_range+0x170/0x2b4
kvm_unmap_hva_handler+0x30/0x40
handle_hva_to_gpa+0xb0/0xec
kvm_unmap_hva_range+0x5c/0xd0
I even injected a fault on purpose in kvm_unmap_hva_range by seting
size=size-0x200, the call trace is similar as above. So I thought the
panic is similarly caused by the root cause of WARN_ON.
Andrea said:
: It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
: zap those bits and hide the misalignment caused by the low metadata
: bits being erroneously left set in the address, but the arm code
: notices when that's the last page in the memslot and the hva_end is
: getting aligned and the size is below one page.
:
: I think the problem triggers in the addr += PAGE_SIZE of
: unmap_stage2_ptes that never matches end because end is aligned but
: addr is not.
:
: } while (pte++, addr += PAGE_SIZE, addr != end);
:
: x86 again only works on hva_start/hva_end after converting it to
: gfn_start/end and that being in pfn units the bits are zapped before
: they risk to cause trouble.
Jia He said:
: I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
: this patch, the WARN_ON is very easy for reproducing. After this patch, I
: have run the same benchmarch for a whole day without any WARN_ONs
Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Jia He <hejianet@gmail.com>
Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing comment provides a good overview of KSM implementation. Let's
update it to reflect recent additions of "chain" and "dup" variants of the
stable tree nodes and mark it as "DOC:" for inclusion into the KSM
documentation.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Mike Rapoport says:
These patches convert files in Documentation/vm to ReST format, add an
initial index and link it to the top level documentation.
There are no contents changes in the documentation, except few spelling
fixes. The relatively large diffstat stems from the indentation and
paragraph wrapping changes.
I've tried to keep the formatting as consistent as possible, but I could
miss some places that needed markup and add some markup where it was not
necessary.
[jc: significant conflicts in vm/hmm.rst]
This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.
This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans
The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)
The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page. This means that
split_huge_page will find a value of the reference counter too high and
will fail.
This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them. If so, the
hugepage is split safely. This means that the hugepage is not split if
not necessary.
Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ADI is a new feature supported on SPARC M7 and newer processors to allow
hardware to catch rogue accesses to memory. ADI is supported for data
fetches only and not instruction fetches. An app can enable ADI on its
data pages, set version tags on them and use versioned addresses to
access the data pages. Upper bits of the address contain the version
tag. On M7 processors, upper four bits (bits 63-60) contain the version
tag. If a rogue app attempts to access ADI enabled data pages, its
access is blocked and processor generates an exception. Please see
Documentation/sparc/adi.txt for further details.
This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
MCD (Memory Corruption Detection) on selected memory ranges, enable
TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
version tags on page swap out/in or migration. ADI is not enabled by
default for any task. A task must explicitly enable ADI on a memory
range and set version tag for ADI to be effective for the task.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.
When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary
in all cases.
This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end. It also adds documentation in all
those cases explaining why.
Below is a more in depth analysis of why this is fine to do this:
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space). There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:
A) page backing address is free before mmu_notifier_invalidate_range_end
B) a page table entry is updated to point to a new page (COW, write fault
on zero page, __replace_page(), ...)
Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.
Case B is more subtle. For correctness it requires the following sequence
to happen:
- take page table lock
- clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
- set page table entry to point to new page
If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.
Consider the following scenario (device use a feature similar to ATS/
PASID):
Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).
[Time N] -----------------------------------------------------------------
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA and populate device TLB}
DEV-thread-2 {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {write to addrA which is a write to new page}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {}
CPU-thread-3 {write to addrB which is a write to new page}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}
So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.
When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock. This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end
Thanks to Andrea for thinking of a problematic scenario for COW.
[jglisse@redhat.com: v2]
Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>