Commit Graph

400 Commits

Author SHA1 Message Date
Bing Han
034877c195 ANDROID: mm: export swapcache_free_entries
Export swapcache_free_entries to be used in the alternative function
android_vh_drain_slots_cache_cpu to swap entries in swap slot cache,
it's usage is similar to the usage in drain_slots_cache_cpu.

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: Ia89b1728d540c5cc8995a939a918e12c23057266
2022-06-30 03:00:23 +00:00
Bing Han
06c2766cbc ANDROID: mm: export symbols used in vendor hook android_vh_get_swap_page()
3 symbols are exported to be used in vendor hook android_vh_get_swap_page:
1)check_cache_active, used to get swap page from the specified
swap location, it's usage is similar to the usage in get_swap_page
2)scan_swap_map_slots, used to get swap page from the specified swap,
it's usage is similar to get_swap_pages
3)swap_alloc_cluster, used to get swap page from the specified swap,
it's usage is similar to get_swap_pages

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: Ie24c5d32a16c7cb87905d034095ec8fb070dbe0f
2022-06-30 03:00:23 +00:00
Bing Han
4506bcbba5 ANDROID: mm: export swap_type_to_swap_info
The function swap_type_to_swap_info is exported to access the
swap_info_struct of the specified swap, which is regarded as
reserved extended memory.

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: I0107e7d561150f1945a4c161e886e9e03383fff6
2022-06-30 03:00:23 +00:00
Bing Han
ed2b11d639 ANDROID: vendor_hook: Add hook in si_swapinfo()
Provide a vendor hook android_vh_si_swapinf to replace the
process of updating nr_to_be_unused. When the page is swapped
to a specified swap location, nr_to_be_unused should not be
updated. Because the specified swap is regarded as a reserved
extended memory.

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: Ie41caec345658589bf908fb0f96d038d1fba21f3
2022-06-30 03:00:23 +00:00
Bing Han
667f0d71dc ANDROID: vendor_hooks: Add hooks to extend the struct swap_info_struct
Two vendor hooks are added to extend the struct swap_info_struct:
android_vh_alloc_si, extend the allocation of struct swap_info_struct,
adding data to record the information of specified reclaimed location;
android_vh_init_swap_info_struct, adding initializing the extension of
struct swap_info_struct;

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: I0e1d8e38ba7dfd52b609b1c14eb78f8b0ef0f9e6
2022-06-30 03:00:23 +00:00
Bing Han
bc4c73c182 ANDROID: vendor_hook: Add hooks in unuse_pte_range() and try_to_unuse()
When the page is unused, a vendor hook android_vh_unuse_swap_page
should be called to specify that the page should not be swapped
to the specified swap location any more.

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: I3fc3675020517f7cc69c76a06150dfb2380dae21
2022-06-30 03:00:23 +00:00
Bing Han
d2fea0ba9a ANDROID: vendor_hook: Add hook to update nr_swap_pages and total_swap_pages
The specified swap is regarded as reserved extended memory.
So nr_swap_pages and total_swap_pages should not be affected
by the specified swap.

Provide a vendor hook android_vh_account_swap_pages to replace
the updating process of nr_swap_pages and total_swap_pages.
When the page is swapped to the specified swap location,
nr_swap_pages and total_swap_pages should not be updated.

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: Ib8dfb355d190399a037b9d9eda478a81c436e224
2022-06-30 03:00:23 +00:00
Liujie Xie
d9845e9e5c ANDROID: export walk_page_range and swp_swap_info
Export walk_page_range and swp_swap_info for reading swap
from backing device to zram.

Bug: 225273514
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: If888cfc2823d8003b62bdb177740643696cf6f7e
2022-03-29 17:40:45 +00:00
Greg Kroah-Hartman
b1a6760ddf Merge branch 'android12-5.10' into android12-5.10-lts
Sync up with android12-5.10 for the following commits:

dd139186ef ANDROID: usb: gadget: fix NULL pointer dereference in android_setup
07f65598af ANDROID: GKI: Disable kmem cgroup accounting
309aa7e7a2 FROMLIST: mm, memcg: inline swap-related functions to improve disabled memcg config
3ae8e2f183 BACKPORT: FROMLIST: mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg config
f73d029485 FROMLIST: mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related functions
669df367a9 UPSTREAM: mm/memcg: bail early from swap accounting if memcg disabled
1f0c32a667 UPSTREAM: procfs/dmabuf: add inode number to /proc/*/fdinfo
0c8c125f57 UPSTREAM: procfs: allow reading fdinfo with PTRACE_MODE_READ
2e0476a465 Revert "FROMLIST: procfs: Allow reading fdinfo with PTRACE_MODE_READ"
5ded961aa2 Revert "FROMLIST: BACKPORT: procfs/dmabuf: Add inode number to /..."
3ee5565017 UPSTREAM: f2fs: initialize page->private when using for our internal use
dba79c3af3 ANDROID: mm: page_pinner: report test_page_isolation_failure
13362ab28e ANDROID: mm: page_pinner: add state of page_pinner
3254948484 ANDROID: mm: page_pinner: add more struct page fields
0445b67bee ANDROID: mm: page_pinner: change timestamp format
71da06728c ANDROID: mm: page_pinner: print_page_pinner refactoring
b83e564914 ANDROID: mm: page_pinner: remove shared_count
849f048050 ANDROID: mm: page_pinner: remove WARN_ON_ONCE
9a453100fc ANDROID: mm: page_pinner: fix typos
d012783a86 ANDROID: mm: page_pinner: reset migration failed page
470cce5085 ANDROID: mm: page_pinner: record every put_page
9f47e5fdda ANDROID: mm: page_pinner: change function names
a8385d61f2 ANDROID: Allow vendor module to reclaim a memcg
f41a95eadc ANDROID: Export memcg functions to allow module to add new files
46bf3b94e7 FROMGIT: dt-bindings: usb: dwc3: Update dwc3 TX fifo properties
b36b813e39 UPSTREAM: dt-bindings: usb: Convert DWC USB3 bindings to DT schema
9a80b7b728 FROMGIT: of: Add stub for of_add_property()
2742be5903 ANDROID: fips140: define fips_enabled to 1 to enable FIPS behavior
e886dd4c33 ANDROID: fips140: unregister existing DRBG algorithms
634445a640 ANDROID: fips140: fix deadlock in unregister_existing_fips140_algos()
0af06624ea ANDROID: fips140: check for errors from initcalls
92de53472e ANDROID: fips140: log already-live algorithms
0a7da21583 ANDROID: Update new mtk gki symbol
98085b5dd8 ANDROID: usb: Add vendor hook for usb suspend and resume
956db89e71 BACKPORT: FROMLIST: dma-heap: Let dma heap use dma_map_attrs to map & unmap iova
749d6e7f2c ANDROID: abi_gki_aarch64_qcom: Add vendor hook for shmem_alloc_page
b05bbe48be ANDROID: abi_gki_aarch64_qcom: Add reclaim_shmem_address_space
d80c70d7a8 ANDROID: android: export kernel function arch_mmap_rnd
25c7eb4932 ANDROID: mm: shmem: Fix build break with allnoconfig
1cdcf76b15 ANDROID: vendor_hooks: add hooks in mem_cgroup subsystem
726468dd4a ANDROID: GKI: add vendor padding variable in struct skb_shared_info
fc79c93657 FROMLIST: scsi: ufs: add quirk to enable host controller without interface configuration
2d5ae6b787 FROMLIST: scsi: ufs: add quirk to handle broken UIC command
38abaebab7 ANDROID: syscall_check: add vendor hook for bpf syscall
a7a3b31d58 ANDROID: syscall_check: add vendor hook for open syscall
a5543c9cd7 ANDROID: syscall_check: add vendor hook for mmap syscall
1f0769279f ANDROID: GKI: Add symbol to symbol list
2cff74e08c ANDROID: vendor_hooks: Add vendor hook to the net
25edba0d4d FROMLIST: scsi: ufs: Fix the SCSI abort handler
c0efdc4a5e ANDROID: android: export kernel function vm_unmapped_area
964220d080 ANDROID: shmem: vendor hook in shmem_alloc_page
bd2ca0ba5b FROMLIST: pstore/ram: Rework logic for detecting ramoops reserved memory region
daeabfe7fa ANDROID: mm: add reclaim_shmem_address_space() for faster reclaims
4c3dddf408 ANDROID: Update the generic ABI symbol list
4c4d8cbdef ANDROID: GKI: refresh ABI XML
01e4a037d8 ANDROID: GKI: turn on TIDY_ABI
edf973fd24 ANDROID: Update symbol list for VIVO
1702d2c8b7 FROMGIT: net: cdc_ncm: switch to eth%d interface naming
f4d6e8324c ANDROID: GKI: add allowed GKI symbol for Exynosauto SoC
444a0b7752 ANDROID: mm: add vendor hook for vmpressure
c799c6644b ANDROID: fips140: adjust some log messages
091338cb39 ANDROID: fips140: add missing static keyword to fips140_init()
70bfd6a7e0 ANDROID: GKI: update allowed list for exynosauto SoC
3e3147b280 UPSTREAM: scsi: ufs: ufshcd: Fix some function doc-rot
2c553e754f UPSTREAM: scsi: ufs: Adjust ufshcd_hold() during sending attribute requests
52ccdf90b9 FROMLIST: lockdep: Remove console_verbose when disable lock debugging
4458494476 ANDROID: ABI: qcom: Add symbols for 80211
5c51579fde ANDROID: fork: Export task_newtask tracepoint
e2a90797e8 ANDROID: Fix kernelci warnings for indentation in smp.c
bac33eaebf ANDROID: irqchip: gic-v3: Move struct gic_chip_data to header
bdac4418bf ANDROID: abi_gki_aarch64_qcom: Add android_vh_ufs_clock_scaling
65c1de0f06 ANDROID: Update symbol list for mtk
d4d02ab9b0 UPSTREAM: swiotlb: manipulate orig_addr when tlb_addr has offset
58aa0f2832 ANDROID: qcom: Add net related symbol
2f9f816445 ANDROID: Update the exynos symbol list
b2a9471239 ANDROID: Update symbol list for mtk
7c9599e204 FROMGIT: usb: dwc3: Create helper function getting MDWIDTH
0a24affb86 ANDROID: vendor_hooks: modify the function name
d686d5ffc6 ANDROID: GKI: Add some symbols to symbol list
bdfb11230b ANDROID: cpuidle: Allow for an early exit from cpuidle_enter_state()
f0b280c395 ANDROID: cpuidle: Update cpuidle_uninstall_idle_handler() to wakeup all online CPUs
14dd90ab37 ANDROID: scsi: ufs: Add hook to influence the UFS clock scaling policy
00aec39e2e FROMGIT: bpf: Support all gso types in bpf_skb_change_proto()

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6e5be89f3f02c420237a549f4c6a08b5ed434581
2021-07-13 15:00:50 +02:00
Suren Baghdasaryan
309aa7e7a2 FROMLIST: mm, memcg: inline swap-related functions to improve disabled memcg config
Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and
cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static
key check inline before calling the main body of the function. This
minimizes the memcg overhead in the pagefault and exit_mmap paths when
memcgs are disabled using cgroup_disable=memory command-line option.
This change results in ~1% overhead reduction when running PFT test [1]
comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
configuration on an 8-core ARM64 Android device.

[1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>

Link: https://lore.kernel.org/patchwork/patch/1458908/
Bug: 191223209
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I18d59090ec908037b39324d1f1bb511d06e9c690
2021-07-12 18:34:30 -07:00
Suren Baghdasaryan
f73d029485 FROMLIST: mm, memcg: add mem_cgroup_disabled checks in vmpressure and swap-related functions
Add mem_cgroup_disabled check in vmpressure, mem_cgroup_uncharge_swap and
cgroup_throttle_swaprate functions. This minimizes the memcg overhead in
the pagefault and exit_mmap paths when memcgs are disabled using
cgroup_disable=memory command-line option.
This change results in ~2.1% overhead reduction when running PFT test [1]
comparing {CONFIG_MEMCG=n, CONFIG_MEMCG_SWAP=n} against {CONFIG_MEMCG=y,
CONFIG_MEMCG_SWAP=y, cgroup_disable=memory} configuration on an 8-core
ARM64 Android device.

[1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>

Link: https://lore.kernel.org/patchwork/patch/1458906/
Bug: 191223209
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic1fc75eb1e4d7a9848cf641b9f232ad3262c490b
2021-07-12 18:26:15 -07:00
Greg Kroah-Hartman
948d38f94d Merge 5.10.46 into android12-5.10-lts
Changes in 5.10.46
	dmaengine: idxd: add missing dsa driver unregister
	dmaengine: fsl-dpaa2-qdma: Fix error return code in two functions
	dmaengine: xilinx: dpdma: initialize registers before request_irq
	dmaengine: ALTERA_MSGDMA depends on HAS_IOMEM
	dmaengine: QCOM_HIDMA_MGMT depends on HAS_IOMEM
	dmaengine: SF_PDMA depends on HAS_IOMEM
	dmaengine: stedma40: add missing iounmap() on error in d40_probe()
	afs: Fix an IS_ERR() vs NULL check
	mm/memory-failure: make sure wait for page writeback in memory_failure
	kvm: LAPIC: Restore guard to prevent illegal APIC register access
	fanotify: fix copy_event_to_user() fid error clean up
	batman-adv: Avoid WARN_ON timing related checks
	mac80211: fix skb length check in ieee80211_scan_rx()
	mlxsw: reg: Spectrum-3: Enforce lowest max-shaper burst size of 11
	mlxsw: core: Set thermal zone polling delay argument to real value at init
	libbpf: Fixes incorrect rx_ring_setup_done
	net: ipv4: fix memory leak in netlbl_cipsov4_add_std
	vrf: fix maximum MTU
	net: rds: fix memory leak in rds_recvmsg
	net: dsa: felix: re-enable TX flow control in ocelot_port_flush()
	net: lantiq: disable interrupt before sheduling NAPI
	netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
	ice: add ndo_bpf callback for safe mode netdev ops
	ice: parameterize functions responsible for Tx ring management
	udp: fix race between close() and udp_abort()
	rtnetlink: Fix regression in bridge VLAN configuration
	net/sched: act_ct: handle DNAT tuple collision
	net/mlx5e: Remove dependency in IPsec initialization flows
	net/mlx5e: Fix page reclaim for dead peer hairpin
	net/mlx5: Consider RoCE cap before init RDMA resources
	net/mlx5: DR, Allow SW steering for sw_owner_v2 devices
	net/mlx5: DR, Don't use SW steering when RoCE is not supported
	net/mlx5e: Block offload of outer header csum for UDP tunnels
	netfilter: synproxy: Fix out of bounds when parsing TCP options
	mptcp: Fix out of bounds when parsing TCP options
	sch_cake: Fix out of bounds when parsing TCP options and header
	mptcp: try harder to borrow memory from subflow under pressure
	mptcp: do not warn on bad input from the network
	selftests: mptcp: enable syncookie only in absence of reorders
	alx: Fix an error handling path in 'alx_probe()'
	cxgb4: fix endianness when flashing boot image
	cxgb4: fix sleep in atomic when flashing PHY firmware
	cxgb4: halt chip before flashing PHY firmware image
	net: stmmac: dwmac1000: Fix extended MAC address registers definition
	net: make get_net_ns return error if NET_NS is disabled
	net: qualcomm: rmnet: Update rmnet device MTU based on real device
	net: qualcomm: rmnet: don't over-count statistics
	ethtool: strset: fix message length calculation
	qlcnic: Fix an error handling path in 'qlcnic_probe()'
	netxen_nic: Fix an error handling path in 'netxen_nic_probe()'
	cxgb4: fix wrong ethtool n-tuple rule lookup
	ipv4: Fix device used for dst_alloc with local routes
	net: qrtr: fix OOB Read in qrtr_endpoint_post
	bpf: Fix leakage under speculation on mispredicted branches
	ptp: improve max_adj check against unreasonable values
	net: cdc_ncm: switch to eth%d interface naming
	lantiq: net: fix duplicated skb in rx descriptor ring
	net: usb: fix possible use-after-free in smsc75xx_bind
	net: fec_ptp: fix issue caused by refactor the fec_devtype
	net: ipv4: fix memory leak in ip_mc_add1_src
	net/af_unix: fix a data-race in unix_dgram_sendmsg / unix_release_sock
	net/mlx5: E-Switch, Read PF mac address
	net/mlx5: E-Switch, Allow setting GUID for host PF vport
	net/mlx5: Reset mkey index on creation
	be2net: Fix an error handling path in 'be_probe()'
	net: hamradio: fix memory leak in mkiss_close
	net: cdc_eem: fix tx fixup skb leak
	cxgb4: fix wrong shift.
	bnxt_en: Rediscover PHY capabilities after firmware reset
	bnxt_en: Fix TQM fastpath ring backing store computation
	bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
	icmp: don't send out ICMP messages with a source address of 0.0.0.0
	net: ethernet: fix potential use-after-free in ec_bhf_remove
	regulator: cros-ec: Fix error code in dev_err message
	regulator: bd70528: Fix off-by-one for buck123 .n_voltages setting
	platform/x86: thinkpad_acpi: Add X1 Carbon Gen 9 second fan support
	ASoC: rt5659: Fix the lost powers for the HDA header
	phy: phy-mtk-tphy: Fix some resource leaks in mtk_phy_init()
	ASoC: fsl-asoc-card: Set .owner attribute when registering card.
	regulator: rtmv20: Fix to make regcache value first reading back from HW
	spi: spi-zynq-qspi: Fix some wrong goto jumps & missing error code
	sched/pelt: Ensure that *_sum is always synced with *_avg
	ASoC: tas2562: Fix TDM_CFG0_SAMPRATE values
	spi: stm32-qspi: Always wait BUSY bit to be cleared in stm32_qspi_wait_cmd()
	regulator: rt4801: Fix NULL pointer dereference if priv->enable_gpios is NULL
	ASoC: rt5682: Fix the fast discharge for headset unplugging in soundwire mode
	pinctrl: ralink: rt2880: avoid to error in calls is pin is already enabled
	drm/sun4i: dw-hdmi: Make HDMI PHY into a platform device
	ASoC: qcom: lpass-cpu: Fix pop noise during audio capture begin
	radeon: use memcpy_to/fromio for UVD fw upload
	hwmon: (scpi-hwmon) shows the negative temperature properly
	mm: relocate 'write_protect_seq' in struct mm_struct
	irqchip/gic-v3: Workaround inconsistent PMR setting on NMI entry
	bpf: Inherit expanded/patched seen count from old aux data
	bpf: Do not mark insn as seen under speculative path verification
	can: bcm: fix infoleak in struct bcm_msg_head
	can: bcm/raw/isotp: use per module netdevice notifier
	can: j1939: fix Use-after-Free, hold skb ref while in use
	can: mcba_usb: fix memory leak in mcba_usb
	usb: core: hub: Disable autosuspend for Cypress CY7C65632
	usb: chipidea: imx: Fix Battery Charger 1.2 CDP detection
	tracing: Do not stop recording cmdlines when tracing is off
	tracing: Do not stop recording comms if the trace file is being read
	tracing: Do no increment trace_clock_global() by one
	PCI: Mark TI C667X to avoid bus reset
	PCI: Mark some NVIDIA GPUs to avoid bus reset
	PCI: aardvark: Fix kernel panic during PIO transfer
	PCI: Add ACS quirk for Broadcom BCM57414 NIC
	PCI: Work around Huawei Intelligent NIC VF FLR erratum
	KVM: x86: Immediately reset the MMU context when the SMM flag is cleared
	KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU
	KVM: X86: Fix x86_emulator slab cache leak
	s390/mcck: fix calculation of SIE critical section size
	s390/ap: Fix hanging ioctl caused by wrong msg counter
	ARCv2: save ABI registers across signal handling
	x86/mm: Avoid truncating memblocks for SGX memory
	x86/process: Check PF_KTHREAD and not current->mm for kernel threads
	x86/ioremap: Map EFI-reserved memory as encrypted for SEV
	x86/pkru: Write hardware init value to PKRU when xstate is init
	x86/fpu: Prevent state corruption in __fpu__restore_sig()
	x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer
	x86/fpu: Reset state for all signal restore failures
	crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo
	dmaengine: pl330: fix wrong usage of spinlock flags in dma_cyclc
	mac80211: Fix NULL ptr deref for injected rate info
	cfg80211: make certificate generation more robust
	cfg80211: avoid double free of PMSR request
	drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell.
	drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue.
	net: ll_temac: Make sure to free skb when it is completely used
	net: ll_temac: Fix TX BD buffer overwrite
	net: bridge: fix vlan tunnel dst null pointer dereference
	net: bridge: fix vlan tunnel dst refcnt when egressing
	mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare
	mm/slub: clarify verification reporting
	mm/slub: fix redzoning for small allocations
	mm/slub: actually fix freelist pointer vs redzoning
	mm/slub.c: include swab.h
	net: stmmac: disable clocks in stmmac_remove_config_dt()
	net: fec_ptp: add clock rate zero check
	tools headers UAPI: Sync linux/in.h copy with the kernel sources
	perf beauty: Update copy of linux/socket.h with the kernel sources
	usb: dwc3: debugfs: Add and remove endpoint dirs dynamically
	usb: dwc3: core: fix kernel panic when do reboot
	Linux 5.10.46

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I99f37c9f257f90ccdb091306f3d4cfb7c32e3880
2021-06-23 17:53:08 +02:00
Peter Xu
12eb3c2c1a mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare
commit 099dd6878b9b12d6bbfa6bf29ce0c8ddd38f6901 upstream.

I found it by pure code review, that pte_same_as_swp() of unuse_vma()
didn't take uffd-wp bit into account when comparing ptes.
pte_same_as_swp() returning false negative could cause failure to
swapoff swap ptes that was wr-protected by userfaultfd.

Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[5.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23 14:42:53 +02:00
Greg Kroah-Hartman
28454baf9c Merge 5.10.21 into android12-5.10
Changes in 5.10.21
	net: usb: qmi_wwan: support ZTE P685M modem
	Input: elantech - fix protocol errors for some trackpoints in SMBus mode
	Input: elan_i2c - add new trackpoint report type 0x5F
	drm/virtio: use kvmalloc for large allocations
	x86/build: Treat R_386_PLT32 relocation as R_386_PC32
	JFS: more checks for invalid superblock
	sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled
	udlfb: Fix memory leak in dlfb_usb_probe
	media: mceusb: sanity check for prescaler value
	erofs: fix shift-out-of-bounds of blkszbits
	media: v4l2-ctrls.c: fix shift-out-of-bounds in std_validate
	xfs: Fix assert failure in xfs_setattr_size()
	net/af_iucv: remove WARN_ONCE on malformed RX packets
	smackfs: restrict bytes count in smackfs write functions
	tomoyo: ignore data race while checking quota
	net: fix up truesize of cloned skb in skb_prepare_for_shift()
	riscv: Get rid of MAX_EARLY_MAPPING_SIZE
	nbd: handle device refs for DESTROY_ON_DISCONNECT properly
	mm/hugetlb.c: fix unnecessary address expansion of pmd sharing
	RDMA/rtrs: Do not signal for heatbeat
	RDMA/rtrs-clt: Use bitmask to check sess->flags
	RDMA/rtrs-srv: Do not signal REG_MR
	tcp: fix tcp_rmem documentation
	mptcp: do not wakeup listener for MPJ subflows
	net: bridge: use switchdev for port flags set through sysfs too
	net/sched: cls_flower: Reject invalid ct_state flags rules
	net: dsa: tag_rtl4_a: Support also egress tags
	net: ag71xx: remove unnecessary MTU reservation
	net: hsr: add support for EntryForgetTime
	net: psample: Fix netlink skb length with tunnel info
	net: fix dev_ifsioc_locked() race condition
	dt-bindings: ethernet-controller: fix fixed-link specification
	dt-bindings: net: btusb: DT fix s/interrupt-name/interrupt-names/
	ASoC: qcom: Remove useless debug print
	rsi: Fix TX EAPOL packet handling against iwlwifi AP
	rsi: Move card interrupt handling to RX thread
	EDAC/amd64: Do not load on family 0x15, model 0x13
	staging: fwserial: Fix error handling in fwserial_create
	x86/reboot: Add Zotac ZBOX CI327 nano PCI reboot quirk
	vt/consolemap: do font sum unsigned
	wlcore: Fix command execute failure 19 for wl12xx
	Bluetooth: hci_h5: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for btrtl
	Bluetooth: btusb: fix memory leak on suspend and resume
	mt76: mt7615: reset token when mac_reset happens
	pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()
	ath10k: fix wmi mgmt tx queue full due to race condition
	net: sfp: add mode quirk for GPON module Ubiquiti U-Fiber Instant
	Bluetooth: Add new HCI_QUIRK_NO_SUSPEND_NOTIFIER quirk
	Bluetooth: Fix null pointer dereference in amp_read_loc_assoc_final_data
	staging: most: sound: add sanity check for function argument
	staging: bcm2835-audio: Replace unsafe strcpy() with strscpy()
	brcmfmac: Add DMI nvram filename quirk for Predia Basic tablet
	brcmfmac: Add DMI nvram filename quirk for Voyo winpad A15 tablet
	drm/hisilicon: Fix use-after-free
	crypto: tcrypt - avoid signed overflow in byte count
	fs: make unlazy_walk() error handling consistent
	drm/amdgpu: Add check to prevent IH overflow
	PCI: Add a REBAR size quirk for Sapphire RX 5600 XT Pulse
	ASoC: Intel: bytcr_rt5640: Add new BYT_RT5640_NO_SPEAKERS quirk-flag
	drm/amd/display: Guard against NULL pointer deref when get_i2c_info fails
	drm/amd/amdgpu: add error handling to amdgpu_virt_read_pf2vf_data
	media: uvcvideo: Allow entities with no pads
	f2fs: handle unallocated section and zone on pinned/atgc
	f2fs: fix to set/clear I_LINKABLE under i_lock
	nvme-core: add cancel tagset helpers
	nvme-rdma: add clean action for failed reconnection
	nvme-tcp: add clean action for failed reconnection
	ASoC: Intel: Add DMI quirk table to soc_intel_is_byt_cr()
	btrfs: fix error handling in commit_fs_roots
	perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]
	ASoC: Intel: sof-sdw: indent and add quirks consistently
	ASoC: Intel: sof_sdw: detect DMIC number based on mach params
	parisc: Bump 64-bit IRQ stack size to 64 KB
	sched/features: Fix hrtick reprogramming
	ASoC: Intel: bytcr_rt5640: Add quirk for the Estar Beauty HD MID 7316R tablet
	ASoC: Intel: bytcr_rt5640: Add quirk for the Voyo Winpad A15 tablet
	ASoC: Intel: bytcr_rt5651: Add quirk for the Jumper EZpad 7 tablet
	ASoC: Intel: bytcr_rt5640: Add quirk for the Acer One S1002 tablet
	scsi: iscsi: Restrict sessions and handles to admin capabilities
	scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZE
	scsi: iscsi: Verify lengths on passthrough PDUs
	Xen/gnttab: handle p2m update errors on a per-slot basis
	xen-netback: respect gnttab_map_refs()'s return value
	xen: fix p2m size in dom0 for disabled memory hotplug case
	zsmalloc: account the number of compacted pages correctly
	remoteproc/mediatek: Fix kernel test robot warning
	swap: fix swapfile read/write offset
	powerpc/sstep: Check instruction validity against ISA version before emulation
	powerpc/sstep: Fix incorrect return from analyze_instr()
	tty: fix up iterate_tty_read() EOVERFLOW handling
	tty: fix up hung_up_tty_read() conversion
	tty: clean up legacy leftovers from n_tty line discipline
	tty: teach n_tty line discipline about the new "cookie continuations"
	tty: teach the n_tty ICANON case about the new "cookie continuations" too
	media: v4l: ioctl: Fix memory leak in video_usercopy
	ALSA: hda/realtek: Add quirk for Clevo NH55RZQ
	ALSA: hda/realtek: Add quirk for Intel NUC 10
	ALSA: hda/realtek: Apply dual codec quirks for MSI Godlike X570 board
	net: sfp: VSOL V2801F / CarlitoxxPro CPGOS03-0490 v2.0 workaround
	net: sfp: add workaround for Realtek RTL8672 and RTL9601C chips
	Linux 5.10.21

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I52b1105b73d893779b3886b577accfabe9f83a16
2021-03-07 12:53:30 +01:00
Jens Axboe
04b049ac9c swap: fix swapfile read/write offset
commit caf6912f3f4af7232340d500a4a2008f81b93f14 upstream.

We're not factoring in the start of the file for where to write and
read the swapfile, which leads to very unfortunate side effects of
writing where we should not be...

Fixes: dd6bd0d9c7 ("swap: use bdev_read_page() / bdev_write_page()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07 12:34:15 +01:00
Will Deacon
cab48b24a8 BACKPORT: FROMGIT: mm: Use static initialisers for immutable fields of 'struct vm_fault'
In preparation for const-ifying the anonymous struct field of
'struct vm_fault', ensure that it is initialised using designated
initialisers.

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Change-Id: Ib2c84bbc4d59fe1811465e59c89f8eb7f73e6229
Bug: 171278850
(cherry picked from commit 8c63ca5bc3e19f11128e8e285dcf20aac6768f97
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/faultaround)
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2021-02-11 12:15:40 +00:00
Greg Kroah-Hartman
39564d70ad Merge 5.10.12 into android12-5.10
Changes in 5.10.12
	gpio: mvebu: fix pwm .get_state period calculation
	Revert "mm/slub: fix a memory leak in sysfs_slab_add()"
	futex: Ensure the correct return value from futex_lock_pi()
	futex: Replace pointless printk in fixup_owner()
	futex: Provide and use pi_state_update_owner()
	rtmutex: Remove unused argument from rt_mutex_proxy_unlock()
	futex: Use pi_state_update_owner() in put_pi_state()
	futex: Simplify fixup_pi_state_owner()
	futex: Handle faults correctly for PI futexes
	HID: wacom: Correct NULL dereference on AES pen proximity
	HID: multitouch: Apply MT_QUIRK_CONFIDENCE quirk for multi-input devices
	media: Revert "media: videobuf2: Fix length check for single plane dmabuf queueing"
	media: v4l2-subdev.h: BIT() is not available in userspace
	RDMA/vmw_pvrdma: Fix network_hdr_type reported in WC
	iwlwifi: dbg: Don't touch the tlv data
	kernel/io_uring: cancel io_uring before task works
	io_uring: inline io_uring_attempt_task_drop()
	io_uring: add warn_once for io_uring_flush()
	io_uring: stop SQPOLL submit on creator's death
	io_uring: fix null-deref in io_disable_sqo_submit
	io_uring: do sqo disable on install_fd error
	io_uring: fix false positive sqo warning on flush
	io_uring: fix uring_flush in exit_files() warning
	io_uring: fix skipping disabling sqo on exec
	io_uring: dont kill fasync under completion_lock
	io_uring: fix sleeping under spin in __io_clean_op
	objtool: Don't fail on missing symbol table
	mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint
	mm: fix a race on nr_swap_pages
	tools: Factor HOSTCC, HOSTLD, HOSTAR definitions
	printk: fix buffer overflow potential for print_text()
	printk: fix string termination for record_print_text()
	Linux 5.10.12

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6d96ec78494ebbc0daf4fdecfc13e522c6bd6b42
2021-01-30 14:29:02 +01:00
Zhaoyang Huang
f472a59aa1 mm: fix a race on nr_swap_pages
commit b50da6e9f42ade19141f6cf8870bb2312b055aa3 upstream.

The scenario on which "Free swap = -4kB" happens in my system, which is caused
by several get_swap_pages racing with each other and show_swap_cache_info
happens simutaniously. No need to add a lock on get_swap_page_of_type as we
remove "Presub/PosAdd" here.

ProcessA			ProcessB			ProcessC
ngoals = 1			ngoals = 1
avail = nr_swap_pages(1)	avail = nr_swap_pages(1)
nr_swap_pages(1) -= ngoals
				nr_swap_pages(0) -= ngoals
								nr_swap_pages = -1

Link: https://lkml.kernel.org/r/1607050340-4535-1-git-send-email-zhaoyang.huang@unisoc.com
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:19 +01:00
Vijayanand Jitta
5e07d2eb08 ANDROID: mm: Export si_swapinfo
Export si_swapinfo symbol which is used as part
of meminfo collection from minidump module.

Bug: 176277894
Change-Id: I5dc1672ce649c22dc33d4a544ee5a38f8376becf
Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>
2021-01-11 06:05:40 +00:00
Qian Cai
b11a76b37a mm/swapfile: do not sleep with a spin lock held
We can't call kvfree() with a spin lock held, so defer it.  Fixes a
might_sleep() runtime warning.

Fixes: 873d7bcfd0 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Signed-off-by: Qian Cai <qcai@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-06 10:19:07 -08:00
Miaohe Lin
822bca52ee mm/swapfile.c: fix potential memory leak in sys_swapon
If we failed to drain inode, we would forget to free the swap address
space allocated by init_swap_address_space() above.

Fixes: dc617f29db ("vfs: don't allow writes to swap files")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lkml.kernel.org/r/20200930101803.53884-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:30 -07:00
Miaohe Lin
7a3d52e45e mm/swapfile.c: remove unnecessary goto out in _swap_info_get()
It's unnecessary to goto the out label while out label is just below.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200930102549.1885-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:30 -07:00
Yu Zhao
cc2828b21c mm: remove activate_page() from unuse_pte()
We don't initially add anon pages to active lruvec after commit
b518154e59 ("mm/vmscan: protect the workingset on anonymous LRU").
Remove activate_page() from unuse_pte(), which seems to be missed by the
commit.  And make the function static while we are at it.

Before the commit, we called lru_cache_add_active_or_unevictable() to add
new ksm pages to active lruvec.  Therefore, activate_page() wasn't
necessary for them in the first place.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:30 -07:00
Gao Xiang
3264631548 swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity
SWP_FS is used to make swap_{read,write}page() go through the filesystem,
and it's only used for swap files over NFS for now.  Otherwise it will
directly submit IO to blockdev according to swapfile extents reported by
filesystems in advance.

As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
rename to SWP_FS_OPS.

[1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:29 -07:00
Linus Torvalds
3ad11d7ac8 Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Linus Torvalds
6734e20e39 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "There's quite a lot of code here, but much of it is due to the
  addition of a new PMU driver as well as some arm64-specific selftests
  which is an area where we've traditionally been lagging a bit.

  In terms of exciting features, this includes support for the Memory
  Tagging Extension which narrowly missed 5.9, hopefully allowing
  userspace to run with use-after-free detection in production on CPUs
  that support it. Work is ongoing to integrate the feature with KASAN
  for 5.11.

  Another change that I'm excited about (assuming they get the hardware
  right) is preparing the ASID allocator for sharing the CPU page-table
  with the SMMU. Those changes will also come in via Joerg with the
  IOMMU pull.

  We do stray outside of our usual directories in a few places, mostly
  due to core changes required by MTE. Although much of this has been
  Acked, there were a couple of places where we unfortunately didn't get
  any review feedback.

  Other than that, we ran into a handful of minor conflicts in -next,
  but nothing that should post any issues.

  Summary:

   - Userspace support for the Memory Tagging Extension introduced by
     Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

   - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
     switching.

   - Fix and subsequent rewrite of our Spectre mitigations, including
     the addition of support for PR_SPEC_DISABLE_NOEXEC.

   - Support for the Armv8.3 Pointer Authentication enhancements.

   - Support for ASID pinning, which is required when sharing
     page-tables with the SMMU.

   - MM updates, including treating flush_tlb_fix_spurious_fault() as a
     no-op.

   - Perf/PMU driver updates, including addition of the ARM CMN PMU
     driver and also support to handle CPU PMU IRQs as NMIs.

   - Allow prefetchable PCI BARs to be exposed to userspace using normal
     non-cacheable mappings.

   - Implementation of ARCH_STACKWALK for unwinding.

   - Improve reporting of unexpected kernel traps due to BPF JIT
     failure.

   - Improve robustness of user-visible HWCAP strings and their
     corresponding numerical constants.

   - Removal of TEXT_OFFSET.

   - Removal of some unused functions, parameters and prototypes.

   - Removal of MPIDR-based topology detection in favour of firmware
     description.

   - Cleanups to handling of SVE and FPSIMD register state in
     preparation for potential future optimisation of handling across
     syscalls.

   - Cleanups to the SDEI driver in preparation for support in KVM.

   - Miscellaneous cleanups and refactoring work"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  Revert "arm64: initialize per-cpu offsets earlier"
  arm64: random: Remove no longer needed prototypes
  arm64: initialize per-cpu offsets earlier
  kselftest/arm64: Check mte tagged user address in kernel
  kselftest/arm64: Verify KSM page merge for MTE pages
  kselftest/arm64: Verify all different mmap MTE options
  kselftest/arm64: Check forked child mte memory accessibility
  kselftest/arm64: Verify mte tag inclusion via prctl
  kselftest/arm64: Add utilities and a test to validate mte memory
  perf: arm-cmn: Fix conversion specifiers for node type
  perf: arm-cmn: Fix unsigned comparison to less than zero
  arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
  arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
  arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
  arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
  KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
  arm64: Get rid of arm64_ssbd_state
  KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
  KVM: arm64: Get rid of kvm_arm_have_ssbd()
  KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
  ...
2020-10-12 10:00:51 -07:00
Gao Xiang
4166343058 mm, THP, swap: fix allocating cluster for swapfile by mistake
SWP_FS is used to make swap_{read,write}page() go through the
filesystem, and it's only used for swap files over NFS.  So, !SWP_FS
means non NFS for now, it could be either file backed or device backed.
Something similar goes with legacy SWP_FILE.

So in order to achieve the goal of the original patch, SWP_BLKDEV should
be used instead.

FS corruption can be observed with SSD device + XFS + fragmented
swapfile due to CONFIG_THP_SWAP=y.

I reproduced the issue with the following details:

Environment:

  QEMU + upstream kernel + buildroot + NVMe (2 GB)

Kernel config:

  CONFIG_BLK_DEV_NVME=y
  CONFIG_THP_SWAP=y

Some reproducible steps:

  mkfs.xfs -f /dev/nvme0n1
  mkdir /tmp/mnt
  mount /dev/nvme0n1 /tmp/mnt
  bs="32k"
  sz="1024m"    # doesn't matter too much, I also tried 16m
  xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
  xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
  xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
  xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
  xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

  mkswap /tmp/mnt/sw
  swapon /tmp/mnt/sw

  stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well

Symptoms:
 - FS corruption (e.g. checksum failure)
 - memory corruption at: 0xd2808010
 - segfault

Fixes: f0eea189e8 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
Fixes: 38d8b4e6bd ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-26 10:33:57 -07:00
Christoph Hellwig
1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
a8b456d01c bdi: remove BDI_CAP_SYNCHRONOUS_IO
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device.  Just check up for
the method instead.  The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
21bd900572 mm: split swap_type_of
swap_type_of is used for two entirely different purposes:

 (1) check what swap type a given device/offset corresponds to
 (2) find the first available swap device that can be written to

Mixing both in a single function creates an unreadable mess.  Create two
separate functions instead, and switch both to pass a dev_t instead of
a struct block_device to further simplify the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:19 -06:00
Christoph Hellwig
ef16e1d98c mm: cleanup claim_swapfile
Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:19 -06:00
Steven Price
8a84802e2a mm: Add arch hooks for saving/restoring tags
Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
every physical page, when swapping pages out to disk it is necessary to
save these tags, and later restore them when reading the pages back.

Add some hooks along with dummy implementations to enable the
arch code to handle this.

Three new hooks are added to the swap code:
 * arch_prepare_to_swap() and
 * arch_swap_invalidate_page() / arch_swap_invalidate_area().
One new hook is added to shmem:
 * arch_swap_restore()

Signed-off-by: Steven Price <steven.price@arm.com>
[catalin.marinas@arm.com: add unlock_page() on the error path]
[catalin.marinas@arm.com: dropped the _tags suffix]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-09-04 12:46:07 +01:00
Qian Cai
a449bf58e4 mm/swapfile: fix and annotate various data races
swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
be accessed concurrently separately as noticed by KCSAN,

=== si.highest_bit ===

 write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
  swap_range_alloc+0x81/0x130
  swap_range_alloc at mm/swapfile.c:681
  scan_swap_map_slots+0x371/0xb90
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0xf2/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1795/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

 read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
  scan_swap_map_slots+0x4a6/0xb90
  scan_swap_map_slots at mm/swapfile.c:892
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0xf2/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1795/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

=== si.swap_map[offset] ===

 write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
  __swap_entry_free_locked+0x8c/0x100
  __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
  __swap_entry_free.constprop.20+0x69/0xb0
  free_swap_and_cache+0x53/0xa0
  unmap_page_range+0x7f8/0x1d70
  unmap_single_vma+0xcd/0x170
  unmap_vmas+0x18b/0x220
  exit_mmap+0xee/0x220
  mmput+0x10e/0x270
  do_exit+0x59b/0xf40
  do_group_exit+0x8b/0x180

 read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
  _swap_info_get+0x81/0xa0
  _swap_info_get at mm/swapfile.c:1140
  free_swap_and_cache+0x40/0xa0
  unmap_page_range+0x7f8/0x1d70
  unmap_single_vma+0xcd/0x170
  unmap_vmas+0x18b/0x220
  exit_mmap+0xee/0x220
  mmput+0x10e/0x270
  do_exit+0x59b/0xf40
  do_group_exit+0x8b/0x180

=== si.flags ===

 write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
  scan_swap_map_slots+0x6fe/0xb50
  scan_swap_map_slots at mm/swapfile.c:887
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0x377/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1795/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

 read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
  _swap_info_get+0x41/0xa0
  __swap_info_get at mm/swapfile.c:1114
  put_swap_page+0x84/0x490
  __remove_mapping+0x384/0x5f0
  shrink_page_list+0xff1/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

The writes are under si->lock but the reads are not. For si.highest_bit
and si.swap_map[offset], data race could trigger logic bugs, so fix them
by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
except those isolated reads where they compare against zero which a data
race would cause no harm. Thus, annotate them as intentional data races
using the data_race() macro.

For si.flags, the readers are only interested in a single bit where a
data race there would cause no issue there.

[cai@lca.pw: add a missing annotation for si->flags in memory.c]
  Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:57 -07:00
Matthew Wilcox (Oracle)
6c357848b4 mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:56 -07:00
Joonsoo Kim
3852f6768e mm/swapcache: support to handle the shadow entries
Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache.  This patch implements an infrastructure to store the shadow
entry in the swapcache.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:55 -07:00
Joonsoo Kim
b518154e59 mm/vmscan: protect the workingset on anonymous LRU
In current implementation, newly created or swap-in anonymous page is
started on active list.  Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list.  Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list.  Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.  Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list.  They are
promoted to active list if enough reference happens.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:55 -07:00
Christoph Hellwig
e556f6ba10 block: remove the bd_queue field from struct block_device
Just use bd_disk->queue instead.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 08:08:20 -06:00
Michel Lespinasse
d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Mike Rapoport
e31cf2f4ca mm: don't include asm/pgtable.h if linux/mm.h is already included
Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Johannes Weiner
4c6355b25e mm: memcontrol: charge swapin pages on instantiation
Right now, users that are otherwise memory controlled can easily escape
their containment and allocate significant amounts of memory that they're
not being charged for.  That's because swap readahead pages are not being
charged until somebody actually faults them into their page table.  This
can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
allocations without charging the pages.

There are additional problems with the delayed charging of swap pages:

1. To implement refault/workingset detection for anonymous pages, we
   need to have a target LRU available at swapin time, but the LRU is not
   determinable until the page has been charged.

2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
   stable when the page is isolated from the LRU; otherwise, the locks
   change under us.  But swapcache gets charged after it's already on the
   LRU, and even if we cannot isolate it ourselves (since charging is not
   exactly optional).

The previous patch ensured we always maintain cgroup ownership records for
swap pages.  This patch moves the swapcache charging point from the fault
handler to swapin time to fix all of the above problems.

v2: simplify swapin error checking (Joonsoo)

[hughd@google.com: fix livelock in __read_swap_cache_async()]
  Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
9d82c69438 mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API
With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type.  All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
  Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
be5d0a74c6 mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter.  This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time.  However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Johannes Weiner
6caa6a0703 mm: memcontrol: move out cgroup swaprate throttling
The cgroup swaprate throttling is about matching new anon allocations to
the rate of available IO when that is being throttled.  It's the io
controller hooking into the VM, rather than a memory controller thing.

Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and
drop the @memcg argument which is only used to check whether the preceding
page charge has succeeded and the fault is proceeding.

We could decouple the call from mem_cgroup_try_charge() here as well, but
that would cause unnecessary churn: the following patches convert all
callsites to a new charge API and we'll decouple as we go along.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Johannes Weiner
3fba69a56e mm: memcontrol: drop @compound parameter from memcg charging API
The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging.  The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists.  This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along.  This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally.  That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Randy Dunlap
6f7939405f mm: swapfile: fix /proc/swaps heading and Size/Used/Priority alignment
Fix the heading and Size/Used/Priority field alignments in /proc/swaps.
If the Size and/or Used value is >= 10000000 (8 bytes), then the
alignment by using tab characters is broken.

This patch maintains the use of tabs for alignment.  If spaces are
preferred, we can just use a Field Width specifier for the bytes and
inuse fields.  That way those fields don't have to be a multiple of 8
bytes in width.  E.g., with a field width of 12, both Size and Used
would always fit on the first line of an 80-column wide terminal (only
Priority would be on the second line).

There are actually 2 problems: heading alignment and field width.  On an
xterm, if Used is 7 bytes in length, the tab does nothing, and the
display is like this, with no space/tab between the Used and Priority
fields.  (ugh)

Filename				Type		Size	Used	Priority
/dev/sda8                               partition	16779260	2023012-1

To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look
different, like so:

Filename				Type		Size	Used	Priority
/dev/sda8                               partition	16779260	2086988	-1

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00
Huang Ying
4907058881 swap: reduce lock contention on swap cache from swap slots allocation
In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix tree
per swap device to one swap cache radix tree every 64 MB trunk in commit
4b3ef9daa4 ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next is
the next scanning base that is shared by all CPUs.  So nearby free swap
slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu version
next scanning base (cluster_next_cpu) is added.  Every CPU will use its
own per-cpu next scanning base.  And after finishing scanning a 64MB
trunk, the per-cpu scanning base will be changed to the beginning of
another randomly selected 64MB trunk.  In this way, the probability for
multiple CPUs to operate on the same 64 MB trunk is reduced greatly.
Thus the lock contention is reduced too.  For HDD, because sequential
access is more important for IO performance, the original shared next
scanning base is used.

To test the patch, we have run 16-process pmbench memory benchmark on a
2-socket server machine with 48 cores.  One ram disk is configured as the
swap device per socket.  The pmbench working-set size is much larger than
the available memory so that swapping is triggered.  The memory read/write
ratio is 80/20 and the accessing pattern is random.  In the original
implementation, the lock contention on the swap cache is heavy.  The perf
profiling data of the lock contention code path is as following,

 _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.91
 _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:               7.11
 _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
 _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     1.66
 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.29
 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.03
 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        0.93

After applying this patch, it becomes,

 _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      2.3
 _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap:     2.26
 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:        1.8
 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:         1.19

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 18.5%.  The swapin throughput increases
18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout throughput increases
18.5% from 2.99 GB/s to 3.54 GB/s.

We need really fast disk to show the benefit.  I have tried this on 2
Intel P3600 NVMe disks.  The performance improvement is only about 1%.
The improvement should be better on the faster disks, such as Intel Optane
disk.

[ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel]
  Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com
[ying.huang@intel.com: v4]
  Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00
Huang Ying
09fe06ce0b mm/swapfile.c: use prandom_u32_max()
To improve the code readability and take advantage of the common
implementation.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00
Wei Yang
33e16272fe mm/swapfile.c: __swap_entry_free() always free 1 entry
__swap_entry_free() always frees 1 entry.  Let's remove the usage.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200501015259.32237-2-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00
Huang Ying
ed43af1097 swap: try to scan more free slots even when fragmented
Now, the scalability of swap code will drop much when the swap device
becomes fragmented, because the swap slots allocation batching stops
working.  To solve the problem, in this patch, we will try to scan a
little more swap slots with restricted effort to batch the swap slots
allocation even if the swap device is fragmented.  Test shows that the
benchmark score can increase up to 37.1% with the patch.  Details are as
follows.

The swap code has a per-cpu cache of swap slots.  These batch swap space
allocations to improve swap subsystem scaling.  In the following code
path,

  add_to_swap()
    get_swap_page()
      refill_swap_slots_cache()
        get_swap_pages()
	  scan_swap_map_slots()

scan_swap_map_slots() and get_swap_pages() can return multiple swap
slots for each call.  These slots will be cached in the per-CPU swap
slots cache, so that several following swap slot requests will be
fulfilled there to avoid the lock contention in the lower level swap
space allocation/freeing code path.

But this only works when there are free swap clusters.  If a swap device
becomes so fragmented that there's no free swap clusters,
scan_swap_map_slots() and get_swap_pages() will return only one swap
slot for each call in the above code path.  Effectively, this falls back
to the situation before the swap slots cache was introduced, the heavy
lock contention on the swap related locks kills the scalability.

Why does it work in this way? Because the swap device could be large,
and the free swap slot scanning could be quite time consuming, to avoid
taking too much time to scanning free swap slots, the conservative
method was used.

In fact, this can be improved via scanning a little more free slots with
strictly restricted effort.  Which is implemented in this patch.  In
scan_swap_map_slots(), after the first free swap slot is gotten, we will
try to scan a little more, but only if we haven't scanned too many slots
(< LATENCY_LIMIT).  That is, the added scanning latency is strictly
restricted.

To test the patch, we have run 16-process pmbench memory benchmark on a
2-socket server machine with 48 cores.  Multiple ram disks are
configured as the swap devices.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random, so
the swap space becomes highly fragmented during the test.  In the
original implementation, the lock contention on swap related locks is
very heavy.  The perf profiling data of the lock contention code path is
as following,

 _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap:             21.03
 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    1.92
 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      1.72
 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       0.69

While after applying this patch, it becomes,

 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    4.89
 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      3.85
 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       1.1
 _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88

That is, the lock contention on the swap locks is eliminated.

And the pmbench score increases 37.1%.  The swapin throughput increases
45.7% from 2.02 GB/s to 2.94 GB/s.  While the swapout throughput increases
45.3% from 2.04 GB/s to 2.97 GB/s.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00
Wei Yang
7b9e2de130 mm/swapfile.c: omit a duplicate code by compare tmp and max first
There are two duplicate code to handle the case when there is no available
swap entry.  To avoid this, we can compare tmp and max first and let the
second guard do its job.

No functional change is expected.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:09 -07:00