android_kernel_xiaomi_sm8450

xiaomi-sm8450/android_kernel_xiaomi_sm8450

Author	SHA1	Message	Date
Blagovest Kolenichev	e79e029826	Merge android-5.4.5 (9cdc723) into msm-5.4 * refs/heads/tmp-9cdc723: Revert "usb: dwc3: gadget: Fix logical condition" Revert "FROMLIST: scsi: ufs-qcom: Adjust bus bandwidth voting and unvoting" Linux 5.4.5 r8169: add missing RX enabling for WoL on RTL8125 net: mscc: ocelot: unregister the PTP clock on deinit ionic: keep users rss hash across lif reset xdp: obtain the mem_id mutex before trying to remove an entry. page_pool: do not release pool until inflight == 0. net/mlx5e: ethtool, Fix analysis of speed setting net/mlx5e: Fix translation of link mode into speed net/mlx5e: Fix freeing flow with kfree() and not kvfree() net/mlx5e: Fix SFF 8472 eeprom length act_ct: support asymmetric conntrack net/mlx5e: Fix TXQ indices to be sequential net: Fixed updating of ethertype in skb_mpls_push() hsr: fix a NULL pointer dereference in hsr_dev_xmit() Fixed updating of ethertype in function skb_mpls_pop gre: refetch erspan header from skb->data after pskb_may_pull() cls_flower: Fix the behavior using port ranges with hw-offload net: sched: allow indirect blocks to bind to clsact in TC net: core: rename indirect block ingress cb function tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() tcp: tighten acceptance of ACKs not matching a child socket tcp: fix rejected syncookies due to stale timestamps net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup net: ipv6: add net argument to ip6_dst_lookup_flow net/mlx5e: Query global pause state before setting prio2buffer tipc: fix ordering of tipc module init and exit routine tcp: md5: fix potential overestimation of TCP option space openvswitch: support asymmetric conntrack net/tls: Fix return values to avoid ENOTSUPP net: thunderx: start phy before starting autonegotiation net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add() net: sched: fix dump qlen for sch_mq/sch_mqprio with NOLOCK subqueues net: ethernet: ti: cpsw: fix extra rx interrupt net: dsa: fix flow dissection on Tx path net: bridge: deny dev_set_mac_address() when unregistering mqprio: Fix out-of-bounds access in mqprio_dump inet: protect against too small mtu values. ANDROID: add initial ABI whitelist for android-5.4 ANDROID: abi update for 5.4.4 ANDROID: mm: Throttle rss_stat tracepoint FROMLIST: vsprintf: Inline call to ptr_to_hashval UPSTREAM: rss_stat: Add support to detect RSS updates of external mm UPSTREAM: mm: emit tracepoint when RSS changes Linux 5.4.4 EDAC/ghes: Do not warn when incrementing refcount on 0 r8169: fix rtl_hw_jumbo_disable for RTL8168evl workqueue: Fix missing kfree(rescuer) in destroy_workqueue() blk-mq: make sure that line break can be printed ext4: fix leak of quota reservations ext4: fix a bug in ext4_wait_for_tail_page_commit splice: only read in as much information as there is pipe buffer space rtc: disable uie before setting time and enable after USB: dummy-hcd: increase max number of devices to 32 powerpc: Define arch_is_kernel_initmem_freed() for lockdep mm/shmem.c: cast the type of unmap_start to u64 s390/kaslr: store KASLR offset for early dumps s390/smp,vdso: fix ASCE handling firmware: qcom: scm: Ensure 'a0' status code is treated as signed ext4: work around deleting a file with i_nlink == 0 safely mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction mfd: rk808: Fix RK818 ID template mm, memfd: fix COW issue on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappings powerpc: Fix vDSO clock_getres() powerpc: Avoid clang warnings around setjmp and longjmp omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251 omap: pdata-quirks: revert pandora specific gpiod additions iio: ad7949: fix channels mixups iio: ad7949: kill pointless "readback"-handling code Revert "scsi: qla2xxx: Fix memory leak when sending I/O fails" scsi: qla2xxx: Fix a dma_pool_free() call scsi: qla2xxx: Fix SRB leak on switch command timeout reiserfs: fix extended attributes on the root directory ext4: Fix credit estimate for final inode freeing quota: fix livelock in dquot_writeback_dquots seccomp: avoid overflow in implicit constant conversion ext2: check err when partial != NULL quota: Check that quota is not dirty before release video/hdmi: Fix AVI bar unpack powerpc/xive: Skip ioremap() of ESB pages for LSI interrupts powerpc: Allow flush_icache_range to work across ranges >4GB powerpc/xive: Prevent page fault issues in the machine crash handler powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB coresight: Serialize enabling/disabling a link device. stm class: Lose the protocol driver when dropping its reference ppdev: fix PPGETTIME/PPSETTIME ioctls RDMA/core: Fix ib_dma_max_seg_size() ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init pinctrl: samsung: Fix device node refcount leaks in init code pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init pinctrl: samsung: Fix device node refcount leaks in Exynos wakeup controller init pinctrl: samsung: Add of_node_put() before return in error path pinctrl: armada-37xx: Fix irq mask access in armada_37xx_irq_set_type() pinctrl: rza2: Fix gpio name typos ACPI: PM: Avoid attaching ACPI PM domain to certain devices ACPI: EC: Rework flushing of pending work ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data() ACPI: OSL: only free map once in osl.c ACPI / hotplug / PCI: Allocate resources directly under the non-hotplug bridge ACPI: LPSS: Add dmi quirk for skipping _DEP check for some device-links ACPI: LPSS: Add LNXVIDEO -> BYT I2C1 to lpss_device_links ACPI: LPSS: Add LNXVIDEO -> BYT I2C7 to lpss_device_links ACPI / utils: Move acpi_dev_get_first_match_dev() under CONFIG_ACPI ALSA: hda/realtek - Line-out jack doesn't work on a Dell AIO ALSA: oxfw: fix return value in error path of isochronous resources reservation ALSA: fireface: fix return value in error path of isochronous resources reservation cpufreq: powernv: fix stack bloat and hard limit on number of CPUs PM / devfreq: Lock devfreq in trans_stat_show intel_th: pci: Add Tiger Lake CPU support intel_th: pci: Add Ice Lake CPU support intel_th: Fix a double put_device() in error path powerpc/perf: Disable trace_imc pmu drm/panfrost: Open/close the perfcnt BO perf tests: Fix out of bounds memory access erofs: zero out when listxattr is called with no xattr cpuidle: use first valid target residency as poll time cpuidle: teo: Fix "early hits" handling for disabled idle states cpuidle: teo: Consider hits and misses metrics of disabled states cpuidle: teo: Rename local variable in teo_select() cpuidle: teo: Ignore disabled idle states that are too deep cpuidle: Do not unset the driver if it is there already media: cec.h: CEC_OP_REC_FLAG_ values were swapped media: radio: wl1273: fix interrupt masking on release media: bdisp: fix memleak on release media: vimc: sen: remove unused kthread_sen field media: hantro: Fix picture order count table enable media: hantro: Fix motion vectors usage condition media: hantro: Fix s_fmt for dynamic resolution changes s390/mm: properly clear _PAGE_NOEXEC bit when it is not supported ar5523: check NULL before memcpy() in ar5523_cmd() wil6210: check len before memcpy() calls cgroup: pids: use atomic64_t for pids->limit blk-mq: avoid sysfs buffer overflow with too many CPU cores md: improve handling of bio with REQ_PREFLUSH in md_flush_request() ASoC: fsl_audmix: Add spin lock to protect tdms ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report ASoC: rt5645: Fixed typo for buddy jack support. ASoC: rt5645: Fixed buddy jack support. workqueue: Fix pwq ref leak in rescuer_thread() workqueue: Fix spurious sanity check failures in destroy_workqueue() dm zoned: reduce overhead of backing device checks dm writecache: handle REQ_FUA hwrng: omap - Fix RNG wait loop timeout ovl: relax WARN_ON() on rename to self ovl: fix corner case of non-unique st_dev;st_ino ovl: fix lookup failure on multi lower squashfs lib: raid6: fix awk build warnings rtlwifi: rtl8192de: Fix missing enable interrupt flag rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address btrfs: record all roots for rename exchange on a subvol Btrfs: send, skip backreference walking for extents with many references btrfs: Remove btrfs_bio::flags member btrfs: Avoid getting stuck during cyclic writebacks Btrfs: fix negative subv_writers counter and data space leak after buffered write Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc btrfs: use refcount_inc_not_zero in kill_all_nodes btrfs: use btrfs_block_group_cache_done in update_block_group btrfs: check page->mapping when loading free space cache iwlwifi: pcie: fix support for transmitting SKBs with fraglist usb: typec: fix use after free in typec_register_port() phy: renesas: rcar-gen3-usb2: Fix sysfs interface of "role" usb: dwc3: ep0: Clear started flag on completion usb: dwc3: gadget: Clear started flag for non-IOC usb: dwc3: gadget: Fix logical condition usb: dwc3: pci: add ID for the Intel Comet Lake -H variant virtio-balloon: fix managed page counts when migrating pages between zones virt_wifi: fix use-after-free in virt_wifi_newlink() mtd: rawnand: Change calculating of position page containing BBM mtd: spear_smi: Fix Write Burst mode brcmfmac: disable PCIe interrupts before bus reset EDAC/altera: Use fast register IO for S10 IRQs tpm: Switch to platform_get_irq_optional() tpm: add check after commands attribs tab allocation usb: mon: Fix a deadlock in usbmon between mmap and read usb: core: urb: fix URB structure initialization function USB: adutux: fix interface sanity check usb: roles: fix a potential use after free USB: serial: io_edgeport: fix epic endpoint lookup USB: idmouse: fix interface sanity checks USB: atm: ueagle-atm: add missing endpoint check iio: adc: ad7124: Enable internal reference iio: adc: ad7606: fix reading unnecessary data from device iio: imu: inv_mpu6050: fix temperature reporting using bad unit iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting iio: adis16480: Fix scales factors iio: imu: st_lsm6dsx: fix ODR check in st_lsm6dsx_write_raw iio: adis16480: Add debugfs_reg_access entry ARM: dts: pandora-common: define wl1251 as child node of mmc3 usb: common: usb-conn-gpio: Don't log an error on probe deferral interconnect: qcom: qcs404: Walk the list safely on node removal interconnect: qcom: sdm845: Walk the list safely on node removal xhci: make sure interrupts are restored to correct state xhci: handle some XHCI_TRUST_TX_LENGTH quirks cases as default behaviour. xhci: Increase STS_HALT timeout in xhci_suspend() xhci: fix USB3 device initiated resume race with roothub autosuspend xhci: Fix memory leak in xhci_add_in_port() usb: xhci: only set D3hot for pci device staging: gigaset: add endpoint-type sanity check staging: gigaset: fix illegal free on probe errors staging: gigaset: fix general protection fault on probe staging: vchiq: call unregister_chrdev_region() when driver registration fails staging: rtl8712: fix interface sanity check staging: rtl8188eu: fix interface sanity check staging: exfat: fix multiple definition error of `rename_file' binder: fix incorrect calculation for num_valid usb: host: xhci-tegra: Correct phy enable sequence usb: Allow USB device to be warm reset in suspended state USB: documentation: flags on usb-storage versus UAS USB: uas: heed CAPACITY_HEURISTICS USB: uas: honor flag to avoid CAPACITY16 media: venus: remove invalid compat_ioctl32 handler ceph: fix compat_ioctl for ceph_dir_operations compat_ioctl: add compat_ptr_ioctl() scsi: qla2xxx: Fix memory leak when sending I/O fails scsi: qla2xxx: Fix double scsi_done for abort path scsi: qla2xxx: Fix driver unload hang scsi: qla2xxx: Do command completion on abort timeout scsi: zfcp: trace channel log even for FCP command responses scsi: lpfc: Fix bad ndlp ptr in xri aborted handling Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" nvme: Namepace identification descriptor list is optional usb: gadget: pch_udc: fix use after free usb: gadget: configfs: Fix missing spin_lock_init() BACKPORT: FROMLIST: scsi: ufs: Export query request interfaces ANDROID: update abi with unbindable_ports sysctl BACKPORT: FROMLIST: net: introduce ip_local_unbindable_ports sysctl ANDROID: update abi for 5.4.3 merge ANDROID: update abi_gki_aarch64.xml for ion, drm changes ANDROID: drivers: gpu: drm: export drm_mode_convert_umode symbol ANDROID: ion: flush cache before exporting non-cached buffers Linux 5.4.3 kselftest: Fix NULL INSTALL_PATH for TARGETS runlist perf script: Fix invalid LBR/binary mismatch error EDAC/ghes: Fix locking and memory barrier issues watchdog: aspeed: Fix clock behaviour for ast2600 drm/mcde: Fix an error handling path in 'mcde_probe()' md/raid0: Fix an error message in raid0_make_request() cpufreq: imx-cpufreq-dt: Correct i.MX8MN's default speed grade value ALSA: hda - Fix pending unsol events at shutdown KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332) binder: Handle start==NULL in binder_update_page_range() binder: Prevent repeated use of ->mmap() via NULL mapping binder: Fix race between mmap() and binder_alloc_print_pages() Revert "serial/8250: Add support for NI-Serial PXI/PXIe+485 devices" vcs: prevent write access to vcsu devices thermal: Fix deadlock in thermal thermal_zone_device_check iomap: Fix pipe page leakage during splicing bdev: Refresh bdev size for disks without partitioning bdev: Factor out bdev revalidation into a common helper rfkill: allocate static minor RDMA/qib: Validate ->show()/store() callbacks before calling them can: ucan: fix non-atomic allocation in completion handler spi: Fix NULL pointer when setting SPI_CS_HIGH for GPIO CS spi: Fix SPI_CS_HIGH setting when using native and GPIO CS spi: atmel: Fix CS high support spi: stm32-qspi: Fix kernel oops when unbinding driver spi: spi-fsl-qspi: Clear TDH bits in FLSHCR register crypto: user - fix memory leak in crypto_reportstat crypto: user - fix memory leak in crypto_report crypto: ecdh - fix big endian bug in ECC library crypto: ccp - fix uninitialized list head crypto: geode-aes - switch to skcipher for cbc(aes) fallback crypto: af_alg - cast ki_complete ternary op to int crypto: atmel-aes - Fix IV handling when req->nbytes < ivsize crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr KVM: x86: Grab KVM's srcu lock when setting nested state KVM: x86: Remove a spurious export of a static function KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES KVM: x86: do not modify masked bits of shared MSRs KVM: arm/arm64: vgic: Don't rely on the wrong pending table KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one arm64: dts: exynos: Revert "Remove unneeded address space mapping for soc node" arm64: Validate tagged addresses in access_ok() called from kernel threads drm/i810: Prevent underflow in ioctl drm: damage_helper: Fix race checking plane->state->fb drm/msm: fix memleak on release jbd2: Fix possible overflow in jbd2_log_space_left() kernfs: fix ino wrap-around detection nfsd: restore NFSv3 ACL support nfsd: Ensure CLONE persists data and metadata changes to the target file can: slcan: Fix use-after-free Read in slcan_open tty: vt: keyboard: reject invalid keycodes CIFS: Fix SMB2 oplock break processing CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all() media: rc: mark input device as pointing stick Input: Fix memory leak in psxpad_spi_probe coresight: etm4x: Fix input validation for sysfs. Input: goodix - add upside-down quirk for Teclast X89 tablet Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash Input: synaptics - switch another X1 Carbon 6 to RMI/SMbus soc: mediatek: cmdq: fixup wrong input order of write api ALSA: hda: Modify stream stripe mask only when needed ALSA: hda - Add mute led support for HP ProBook 645 G4 ALSA: pcm: oss: Avoid potential buffer overflows ALSA: hda/realtek - Fix inverted bass GPIO pin on Acer 8951G ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236 ALSA: hda/realtek - Enable the headset-mic on a Xiaomi's laptop ALSA: hda/realtek - Enable internal speaker of ASUS UX431FLC SUNRPC: Avoid RPC delays when exiting suspend io_uring: ensure req->submit is copied when req is deferred io_uring: fix missing kmap() declaration on powerpc fuse: verify attributes fuse: verify write return fuse: verify nlink fuse: fix leak of fuse_io_priv io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR io_uring: fix dead-hung for non-iter fixed rw mwifiex: Re-work support for SDIO HW reset serial: ifx6x60: add missed pm_runtime_disable serial: 8250_dw: Avoid double error messaging when IRQ absent serial: stm32: fix clearing interrupt error flags serial: serial_core: Perform NULL checks for break_ctl ops serial: pl011: Fix DMA ->flush_buffer() tty: serial: msm_serial: Fix flow control tty: serial: fsl_lpuart: use the sg count from dma_map_sg serial: 8250-mtk: Use platform_get_irq_optional() for optional irq usb: gadget: u_serial: add missing port entry locking staging/octeon: Use stubs for MIPS && !CAVIUM_OCTEON_SOC mailbox: tegra: Fix superfluous IRQ error message time: Zero the upper 32-bits in __kernel_timespec on 32-bit lp: fix sparc64 LPSETTIMEOUT ioctl sparc64: implement ioremap_uc perf scripts python: exported-sql-viewer.py: Fix use of TRUE with SQLite arm64: tegra: Fix 'active-low' warning for Jetson Xavier regulator arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator rsi: release skb if rsi_prepare_beacon fails FROMLIST: scsi: ufs: Fix ufshcd_hold() caused scheduling while atomic FROMLIST: scsi: ufs: Add dev ref clock gating wait time support FROMLIST: scsi: ufs-qcom: Adjust bus bandwidth voting and unvoting FROMLIST: scsi: ufs: Remove the check before call setup clock notify vops FROMLIST: scsi: ufs: set load before setting voltage in regulators FROMLIST: scsi: ufs: Flush exception event before suspend FROMLIST: scsi: ufs: Do not rely on prefetched data FROMLIST: scsi: ufs: Fix up clock scaling FROMGIT: scsi: ufs: Do not free irq in suspend FROMGIT: scsi: ufs: Do not clear the DL layer timers FROMGIT: scsi: ufs: Release clock if DMA map fails FROMGIT: scsi: ufs: Use DBD setting in mode sense FROMGIT: scsi: core: Adjust DBD setting in MODE SENSE for caching mode page per LLD FROMGIT: scsi: ufs: Complete pending requests in host reset and restore path FROMGIT: scsi: ufs: Avoid messing up the compl_time_stamp of lrbs FROMGIT: scsi: ufs: Update VCCQ2 and VCCQ min/max voltage hard codes FROMGIT: scsi: ufs: Recheck bkops level if bkops is disabled ANDROID: update abi_gki_aarch64.xml for LTO, CFI, and SCS ANDROID: gki_defconfig: enable LTO, CFI, and SCS ANDROID: update abi_gki_aarch64.xml for CONFIG_GNSS ANDROID: cuttlefish_defconfig: Enable CONFIG_GNSS ANDROID: gki_defconfig: enable HID configs UPSTREAM: arm64: Validate tagged addresses in access_ok() called from kernel threads ANDROID: kbuild: limit LTO inlining ANDROID: kbuild: merge module sections with LTO ANDROID: f2fs: fix possible merge of unencrypted with encrypted I/O ANDROID: gki_defconfig: Enable UCLAMP by default ANDROID: make sure proc mount options are applied ANDROID: sound: usb: Add helper APIs to enable audio stream ANDROID: Update ABI representation ANDROID: Don't base allmodconfig on gki_defconfig ANDROID: Disable UNWINDER_ORC for allmodconfig ANDROID: ASoC: Fix 'allmodconfig' build break Linux 5.4.2 platform/x86: hp-wmi: Fix ACPI errors caused by passing 0 as input size platform/x86: hp-wmi: Fix ACPI errors caused by too small buffer HID: core: check whether Usage Page item is after Usage ID items crypto: talitos - Fix build error by selecting LIB_DES Revert "jffs2: Fix possible null-pointer dereferences in jffs2_add_frag_to_fragtree()" ext4: add more paranoia checking in ext4_expand_extra_isize handling r8169: fix resume on cable plug-in r8169: fix jumbo configuration for RTL8168evl selftests: pmtu: use -oneline for ip route list cache tipc: fix link name length check selftests: bpf: correct perror strings selftests: bpf: test_sockmap: handle file creation failures gracefully net/tls: use sg_next() to walk sg entries net/tls: remove the dead inplace_crypto code selftests/tls: add a test for fragmented messages net: skmsg: fix TLS 1.3 crash with full sk_msg net/tls: free the record on encryption error net/tls: take into account that bpf_exec_tx_verdict() may free the record openvswitch: remove another BUG_ON() openvswitch: drop unneeded BUG_ON() in ovs_flow_cmd_build_info() sctp: cache netns in sctp_ep_common slip: Fix use-after-free Read in slip_open sctp: Fix memory leak in sctp_sf_do_5_2_4_dupcook openvswitch: fix flow command message size net: sched: fix `tc -s class show` no bstats on class with nolock subqueues net: psample: fix skb_over_panic net: macb: add missed tasklet_kill net: dsa: sja1105: fix sja1105_parse_rgmii_delays() mdio_bus: don't use managed reset-controller macvlan: schedule bc_work even if error gve: Fix the queue page list allocated pages count x86/fpu: Don't cache access to fpu_fpregs_owner_ctx thunderbolt: Power cycle the router if NVM authentication fails mei: me: add comet point V device id mei: bus: prefix device names on bus with the bus name USB: serial: ftdi_sio: add device IDs for U-Blox C099-F9P staging: rtl8723bs: Add 024c:0525 to the list of SDIO device-ids staging: rtl8723bs: Drop ACPI device ids staging: rtl8192e: fix potential use after free staging: wilc1000: fix illegal memory access in wilc_parse_join_bss_param() usb: dwc2: use a longer core rest timeout in dwc2_core_reset() driver core: platform: use the correct callback type for bus_find_device crypto: inside-secure - Fix stability issue with Macchiatobin net: disallow ancillary data for __sys_{send,recv}msg_file() net: separate out the msghdr copy from ___sys_{send,recv}msg() io_uring: async workers should inherit the user creds ANDROID: Update ABI representation UPSTREAM: of: property: Add device link support for interrupt-parent, dmas and -gpio(s) UPSTREAM: of: property: Fix the semantics of of_is_ancestor_of() UPSTREAM: i2c: of: Populate fwnode in of_i2c_get_board_info() UPSTREAM: regulator: core: Don't try to remove device links if add failed UPSTREAM: driver core: Clarify documentation for fwnode_operations.add_links() ANDROID: Update ABI representation ANDROID: gki_defconfig: IIO=y ANDROID: Update ABI representation ANDROID: ASoC: core - add hostless DAI support ANDROID: gki_defconfig: =m's applied for virtio configs in arm64 ANDROID: Update ABI representation after 5.4.1 merge Linux 5.4.1 KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel powerpc/book3s64: Fix link stack flush on context switch staging: comedi: usbduxfast: usbduxfast_ai_cmdtest rounding error USB: serial: option: add support for Foxconn T77W968 LTE modules USB: serial: option: add support for DW5821e with eSIM support USB: serial: mos7840: fix remote wakeup USB: serial: mos7720: fix remote wakeup USB: serial: mos7840: add USB ID to support Moxa UPort 2210 appledisplay: fix error handling in the scheduled work USB: chaoskey: fix error case of a timeout usb-serial: cp201x: support Mark-10 digital force gauge usbip: Fix uninitialized symbol 'nents' in stub_recv_cmd_submit() usbip: tools: fix fd leakage in the function of read_attr_usbip_status USBIP: add config dependency for SGL_ALLOC ALSA: hda - Disable audio component for legacy Nvidia HDMI codecs media: mceusb: fix out of bounds read in MCE receiver buffer media: imon: invalid dereference in imon_touch_event media: cxusb: detect cxusb_ctrl_msg error in query media: b2c2-flexcop-usb: add sanity checking media: uvcvideo: Fix error path in control parsing failure futex: Prevent exit livelock futex: Provide distinct return value when owner is exiting futex: Add mutex around futex exit futex: Provide state handling for exec() as well futex: Sanitize exit state handling futex: Mark the begin of futex exit explicitly futex: Set task::futex_state to DEAD right after handling futex exit futex: Split futex_mm_release() for exit/exec exit/exec: Seperate mm_release() futex: Replace PF_EXITPIDONE with a state futex: Move futex exit handling into futex code cpufreq: Add NULL checks to show() and store() methods of cpufreq media: usbvision: Fix races among open, close, and disconnect media: usbvision: Fix invalid accesses after device disconnect media: vivid: Fix wrong locking that causes race conditions on streaming stop media: vivid: Set vid_cap_streaming and vid_out_streaming to true ALSA: usb-audio: Fix Scarlett 6i6 Gen 2 port data ALSA: usb-audio: Fix NULL dereference at parsing BADD futex: Prevent robust futex exit race x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3 x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise selftests/x86/sigreturn/32: Invalidate DS and ES when abusing the kernel selftests/x86/mov_ss_trap: Fix the SYSENTER test x86/entry/32: Fix NMI vs ESPFIX x86/entry/32: Unwind the ESPFIX stack earlier on exception entry x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALL x86/entry/32: Use %ss segment where required x86/entry/32: Fix IRET exception x86/cpu_entry_area: Add guard page for entry stack on 32bit x86/pti/32: Size initial_page_table correctly x86/doublefault/32: Fix stack canaries in the double fault handler x86/xen/32: Simplify ring check in xen_iret_crit_fixup() x86/xen/32: Make xen_iret_crit_fixup() independent of frame layout x86/stackframe/32: Repair 32-bit Xen PV nbd: prevent memory leak x86/speculation: Fix redundant MDS mitigation message x86/speculation: Fix incorrect MDS/TAA mitigation status x86/insn: Fix awk regexp warnings md/raid10: prevent access of uninitialized resync_pages offset Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues" Revert "Bluetooth: hci_ll: set operational frequency earlier" ath10k: restore QCA9880-AR1A (v1) detection ath10k: Fix HOST capability QMI incompatibility ath10k: Fix a NULL-ptr-deref bug in ath10k_usb_alloc_urb_from_pipe ath9k_hw: fix uninitialized variable data Bluetooth: Fix invalid-free in bcsp_close() ANDROID: gki_defconfig: enable CONFIG_REGULATOR_FIXED_VOLTAGE FROMLIST: crypto: arm64/sha: fix function types ANDROID: arm64: kvm: disable CFI ANDROID: arm64: add __nocfi to __apply_alternatives ANDROID: arm64: add __pa_function ANDROID: arm64: add __nocfi to functions that jump to a physical address ANDROID: arm64: bpf: implement arch_bpf_jit_check_func ANDROID: bpf: validate bpf_func when BPF_JIT is enabled with CFI ANDROID: add support for Clang's Control Flow Integrity (CFI) ANDROID: arm64: allow LTO_CLANG and THINLTO to be selected FROMLIST: arm64: fix alternatives with LLVM's integrated assembler FROMLIST: arm64: lse: fix LSE atomics with LLVM's integrated assembler ANDROID: arm64: disable HAVE_ARCH_PREL32_RELOCATIONS with LTO_CLANG ANDROID: arm64: vdso: disable LTO ANDROID: irqchip/gic-v3: rename gic_of_init to work around a ThinLTO+CFI bug ANDROID: soc/tegra: disable ARCH_TEGRA_210_SOC with LTO ANDROID: init: ensure initcall ordering with LTO ANDROID: drivers/misc/lkdtm: disable LTO for rodata.o ANDROID: efi/libstub: disable LTO ANDROID: scripts/mod: disable LTO for empty.c ANDROID: kbuild: fix dynamic ftrace with clang LTO ANDROID: kbuild: add support for Clang LTO ANDROID: kbuild: add CONFIG_LD_IS_LLD FROMGIT: driver core: platform: use the correct callback type for bus_find_device FROMLIST: arm64: implement Shadow Call Stack FROMLIST: arm64: disable SCS for hypervisor code FROMLIST: arm64: vdso: disable Shadow Call Stack FROMLIST: arm64: efi: restore x18 if it was corrupted FROMLIST: arm64: preserve x18 when CPU is suspended FROMLIST: arm64: reserve x18 from general allocation with SCS FROMLIST: arm64: disable function graph tracing with SCS FROMLIST: scs: add support for stack usage debugging FROMLIST: scs: add accounting FROMLIST: add support for Clang's Shadow Call Stack (SCS) FROMLIST: arm64: kernel: avoid x18 in __cpu_soft_restart FROMLIST: arm64: kvm: stop treating register x18 as caller save FROMLIST: arm64/lib: copy_page: avoid x18 register in assembler code FROMLIST: arm64: mm: avoid x18 in idmap_kpti_install_ng_mappings ANDROID: clang: update to 10.0.1 ANDROID: update ABI representation Conflicts: Documentation/devicetree/bindings Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt arch/arm64/Kconfig drivers/firmware/qcom_scm-64.c drivers/hwtracing/coresight/coresight.c drivers/scsi/ufs/ufs.h drivers/scsi/ufs/ufshcd.c drivers/scsi/ufs/ufshcd.h drivers/scsi/ufs/unipro.h drivers/staging/android/ion/heaps/ion_cma_heap.c drivers/staging/android/ion/heaps/ion_system_heap.c drivers/usb/dwc3/ep0.c drivers/usb/dwc3/gadget.c include/sound/pcm.h include/sound/soc.h kernel/exit.c kernel/sched/core.c Change-Id: I66ea973ddcafd352ba999a1dc98e04df33397e3b Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>	2020-01-23 04:00:53 -08:00
Vinayak Menon	75258ab7ca	mm: vmstat: add pageoutclean vmstat events currently count pgpgout, but that includes only the writebacks, and not the reclaim of clean pages. Add an event to count clean page evictions. This is helpful to evaluate page thrashing cases. Change-Id: Icfb797877a544a58c289074bdc290dfbc1384514 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [isaacm@codeaurora.org: Add Kconfig entry to prevent ABI breakages in GKI] Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>	2019-12-17 10:54:01 -08:00
Satya Durga Srinivasu Prabhala	201ea48219	kernel: Add snapshot of changes to support cpu isolation This snapshot is taken from msm-4.19 as of commit 5debecbe7195 ("trace: filter out spurious preemption and IRQs disable traces"). Change-Id: I222aa448ac68f7365065f62dba9db94925da38a0 Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>	2019-12-10 12:42:32 -08:00
Sami Tolvanen	7f498a4b7b	FROMLIST: scs: add accounting This change adds accounting for the memory allocated for shadow stacks. Bug: 145210207 Change-Id: Iee94c22abefcabb63a3bcd4db8ba952130f30a82 (am from https://lore.kernel.org/patchwork/patch/1149055/) Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2019-11-27 12:49:09 -08:00
Michal Hocko	93b3a67448	mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo pagetypeinfo_showfree_print is called by zone->lock held in irq mode. This is not really nice because it blocks both any interrupts on that cpu and the page allocator. On large machines this might even trigger the hard lockup detector. Considering the pagetypeinfo is a debugging tool we do not really need exact numbers here. The primary reason to look at the outuput is to see how pageblocks are spread among different migratetypes and low number of pages is much more interesting therefore putting a bound on the number of pages on the free_list sounds like a reasonable tradeoff. The new output will simply tell [...] Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648 instead of Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648 The limit has been chosen arbitrary and it is a subject of a future change should there be a need for that. While we are at it, also drop the zone lock after each free_list iteration which will help with the IRQ and page allocator responsiveness even further as the IRQ lock held time is always bound to those 100k pages. [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand] Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Mel Gorman <mgorman@suse.de> Cc: Roman Gushchin <guro@fb.com> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-06 08:47:50 -08:00
Michal Hocko	abaed0112c	mm, vmstat: hide /proc/pagetypeinfo from normal users /proc/pagetypeinfo is a debugging tool to examine internal page allocator state wrt to fragmentation. It is not very useful for any other use so normal users really do not need to read this file. Waiman Long has noticed that reading this file can have negative side effects because zone->lock is necessary for gathering data and that a) interferes with the page allocator and its users and b) can lead to hard lockups on large machines which have very long free_list. Reduce both issues by simply not exporting the file to regular users. Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org Fixes: `467c996c1e` ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Waiman Long <longman@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Jann Horn <jannh@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-06 08:47:50 -08:00
Song Liu	60fbf0ab5d	mm,thp: stats for file backed THP In preparation for non-shmem THP, this patch adds a few stats and exposes them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and /proc/<pid>/task/<tid>/smaps. This patch is mostly a rewrite of Kirill A. Shutemov's earlier version: https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/ Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Thomas Gleixner	457c899653	treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-21 10:50:45 +02:00
Konstantin Khlebnikov	e8277b3b52	mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n Commit `58bc4c34d2` ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly") depends on skipping vmstat entries with empty name introduced in `7aaf772723` ("mm: don't show nr_indirectly_reclaimable in /proc/vmstat") but reverted in `b29940c1ab` ("mm: rename and change semantics of nr_indirectly_reclaimable_bytes"). So skipping no longer works and /proc/vmstat has misformatted lines " 0". This patch simply shows debug counters "nr_tlb_remote_" for UP. Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz Fixes: `58bc4c34d2` ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH properly") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <guro@fb.com> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-04-19 09:46:04 -07:00
Greg Kroah-Hartman	d9f7979c92	mm: no need to check return value of debugfs_create functions When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-03-05 21:07:17 -08:00
Arun KS	9705bea5f8	mm: convert zone->managed_pages to atomic variable totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. This patch converts zone->managed_pages. Subsequent patches will convert totalram_panges, totalhigh_pages and eventually managed_page_count_lock will be removed. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-12-28 12:11:47 -08:00
Janne Huttunen	13c9aaf7fa	mm/vmstat.c: fix NUMA statistics updates Scan through the whole array to see if an update is needed. While we're at it, use sizeof() to be safe against any possible type changes in the future. The bug here is that we wouldn't sync per-cpu counters into global ones if there was an update of numa_stats for higher cpus. Highly theoretical one though because it is much more probable that zone_stats are updated so we would refresh anyway. So I wouldn't bother to mark this for stable, yet something nice to fix. [mhocko@suse.com: changelog enhancement] Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com Fixes: `1d90ca897c` ("mm: update NUMA counter threshold size") Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-11-18 10:15:10 -08:00
Jann Horn	f0ecf25a09	mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size Having two gigantic arrays that must manually be kept in sync, including ifdefs, isn't exactly robust. To make it easier to catch such issues in the future, add a BUILD_BUG_ON(). Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:26:35 -07:00
Johannes Weiner	68d48e6a2d	mm: workingset: add vmstat counter for shadow nodes Make it easier to catch bugs in the shadow node shrinker by adding a counter for the shadow nodes in circulation. [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()] [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes] Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:26:33 -07:00
Johannes Weiner	1899ad18c6	mm: workingset: tell cache transitions from workingset thrashing Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:26:32 -07:00
Vlastimil Babka	b29940c1ab	mm: rename and change semantics of nr_indirectly_reclaimable_bytes The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by commit `eb59254608` ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with the goal of accounting objects that can be reclaimed, but cannot be allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user is converted. The counter is however still useful for accounting direct page allocations (i.e. not slab) with a shrinker, such as the ION page pool. So keep it, and: - change granularity to pages to be more like other counters; sub-page allocations should be able to use kmalloc - rename the counter to NR_KERNEL_MISC_RECLAIMABLE - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can again remove the check for not printing "hidden" counters Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:26:32 -07:00
Jann Horn	58bc4c34d2	mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly `5dd0b16cda` ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside the kernel unconditional to reduce #ifdef soup, but (either to avoid showing dummy zero counters to userspace, or because that code was missed) didn't update the vmstat_array, meaning that all following counters would be shown with incorrect values. This only affects kernel builds with CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n. Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com Fixes: `5dd0b16cda` ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-10-05 16:32:05 -07:00
Jann Horn	28e2c4bb99	mm/vmstat.c: fix outdated vmstat_text `7a9cdebdcc` ("mm: get rid of vmacache_flush_all() entirely") removed the VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding entry in vmstat_text. This causes an out-of-bounds access in vmstat_show(). Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which is probably very rare. Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com Fixes: `7a9cdebdcc` ("mm: get rid of vmacache_flush_all() entirely") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-10-05 16:32:05 -07:00
Sebastian Andrzej Siewior	28557cc106	Revert mm/vmstat.c: fix vmstat_update() preemption BUG Revert commit `c7f26ccfb2` ("mm/vmstat.c: fix vmstat_update() preemption BUG"). Steven saw a "using smp_processor_id() in preemptible" message and added a preempt_disable() section around it to keep it quiet. This is not the right thing to do it does not fix the real problem. vmstat_update() is invoked by a kworker on a specific CPU. This worker it bound to this CPU. The name of the worker was "kworker/1:1" so it should have been a worker which was bound to CPU1. A worker which can run on any CPU would have a `u' before the first digit. smp_processor_id() can be used in a preempt-enabled region as long as the task is bound to a single CPU which is the case here. If it could run on an arbitrary CPU then this is the problem we have an should seek to resolve. Not only this smp_processor_id() must not be migrated to another CPU but also refresh_cpu_vm_stats() which might access wrong per-CPU variables. Not to mention that other code relies on the fact that such a worker runs on one specific CPU only. Therefore revert that commit and we should look instead what broke the affinity mask of the kworker. Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven J. Hill <steven.hill@cavium.com> Cc: Tejun Heo <htejun@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-28 11:16:44 -07:00
Christoph Hellwig	fddda2b7b5	proc: introduce proc_create_seq{,_data} Variants of proc_create{,_data} that directly take a struct seq_operations argument and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-16 07:23:35 +02:00
Roman Gushchin	7aaf772723	mm: don't show nr_indirectly_reclaimable in /proc/vmstat Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is no need to export this vm counter to userspace, and some changes are expected in reclaimable object accounting, which can alter this counter. Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-05-11 17:28:45 -07:00
Roman Gushchin	eb59254608	mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES Patch series "indirectly reclaimable memory", v2. This patchset introduces the concept of indirectly reclaimable memory and applies it to fix the issue of when a big number of dentries with external names can significantly affect the MemAvailable value. This patch (of 3): Introduce a concept of indirectly reclaimable memory and adds the corresponding memory counter and /proc/vmstat item. Indirectly reclaimable memory is any sort of memory, used by the kernel (except of reclaimable slabs), which is actually reclaimable, i.e. will be released under memory pressure. The counter is in bytes, as it's not always possible to count such objects in pages. The name contains BYTES by analogy to NR_KERNEL_STACK_KB. Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-11 10:28:29 -07:00
Steven J. Hill	c7f26ccfb2	mm/vmstat.c: fix vmstat_update() preemption BUG Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can cause vmstat_update() to report a BUG due to preemption not being disabled around smp_processor_id(). Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor. BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:1/269 caller is vmstat_update+0x50/0xa0 CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted 4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1 Workqueue: mm_percpu_wq vmstat_update Call Trace: show_stack+0x94/0x128 dump_stack+0xa4/0xe0 check_preemption_disabled+0x118/0x120 vmstat_update+0x50/0xa0 process_one_work+0x144/0x348 worker_thread+0x150/0x4b8 kthread+0x110/0x140 ret_from_kernel_thread+0x14/0x1c Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com Signed-off-by: Steven J. Hill <steven.hill@cavium.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-03-28 13:42:05 -10:00
Kemi Wang	4518085e12	mm, sysctl: make NUMA stats configurable This is the second step which introduces a tunable interface that allow numa stats configurable for optimizing zone_statistics(), as suggested by Dave Hansen and Ying Huang. ========================================================================= When page allocation performance becomes a bottleneck and you can tolerate some possible tool breakage and decreased numa counter precision, you can do: echo 0 > /proc/sys/vm/numa_stat In this case, numa counter update is ignored. We can see about 4.8%(185->176) drop of cpu cycles per single page allocation and reclaim on Jesper's page_bench01 (single thread) and 8.1%(343->315) drop of cpu cycles per single page allocation and reclaim on Jesper's page_bench03 (88 threads) running on a 2-Socket Broadwell-based server (88 threads, 126G memory). Benchmark link provided by Jesper D Brouer (increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench ========================================================================= When page allocation performance is not a bottleneck and you want all tooling to work, you can do: echo 1 > /proc/sys/vm/numa_stat This is system default setting. Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka for comments to help improve the original patch. [keescook@chromium.org: make sure mutex is a global static] Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christopher Lameter <cl@linux.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:07 -08:00
Andrey Ryabinin	3a50d14d0d	mm: remove unused pgdat->inactive_ratio Since commit `59dc76b0d4` ("mm: vmscan: reduce size of inactive file list") 'pgdat->inactive_ratio' is not used, except for printing "node_inactive_ratio: 0" in /proc/zoneinfo output. Remove it. Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:03 -08:00
Kemi Wang	638032224e	mm: consider the number in local CPUs when reading NUMA stats To avoid deviation, the per cpu number of NUMA stats in vm_numa_stat_diff[] is included when a user reads the NUMA stats. Since NUMA stats does not be read by users frequently, and kernel does not need it to make a decision, it will not be a problem to make the readers more expensive. Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Ying Huang <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:47 -07:00
Kemi Wang	1d90ca897c	mm: update NUMA counter threshold size There is significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (suggested by Dave Hansen). This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter(suggested by Ying Huang). The rationality is that these statistics counters don't affect the kernel's decision, unlike other VM counters, so it's not a problem to use a large threshold. With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default (base) 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Christopher Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:47 -07:00
Kemi Wang	3a321d2a3d	mm: change the call sites of numa statistics items Patch series "Separate NUMA statistics from zone statistics", v2. Each page allocation updates a set of per-zone statistics with a call to zone_statistics(). As discussed in 2017 MM summit, these are a substantial source of overhead in the page allocator and are very rarely consumed. This significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (pointed out by Dave Hansen). A link to the MM summit slides: http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf To mitigate this overhead, this patchset separates NUMA statistics from zone statistics framework, and update NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter (suggested by Ying Huang). The rationality is that these statistics counters don't need to be read often, unlike other VM counters, so it's not a problem to use a large threshold and make readers more expensive. With this patchset, we see 31.3% drop of CPU cycles(537-->369, see below) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Meanwhile, this patchset keeps the same style of virtual memory statistics with little end-user-visible effects (only move the numa stats to show behind zone page stats, see the first patch for details). I did an experiment of single page allocation and reclaim concurrently using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server (88 processors with 126G memory) with different size of threshold of pcp counter. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics This patch (of 3): In this patch, NUMA statistics is separated from zone statistics framework, all the call sites of NUMA stats are changed to use numa-stats-specific functions, it does not have any functionality change except that the number of NUMA stats is shown behind zone page stats when users read the zone info. E.g. cat /proc/zoneinfo *Base* *With this patch* nr_free_pages 3976 nr_free_pages 3976 nr_zone_inactive_anon 0 nr_zone_inactive_anon 0 nr_zone_active_anon 0 nr_zone_active_anon 0 nr_zone_inactive_file 0 nr_zone_inactive_file 0 nr_zone_active_file 0 nr_zone_active_file 0 nr_zone_unevictable 0 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_zone_write_pending 0 nr_mlock 0 nr_mlock 0 nr_page_table_pages 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_kernel_stack 0 nr_bounce 0 nr_bounce 0 nr_zspages 0 nr_zspages 0 numa_hit 0 nr_free_cma 0 numa_miss 0 numa_hit 0 numa_foreign 0 numa_miss 0 numa_interleave 0 numa_foreign 0 numa_local 0 numa_interleave 0 numa_other 0 numa_local 0 nr_free_cma 0 numa_other 0 ... ... vm stats threshold: 10 vm stats threshold: 10 ... ... The next patch updates the numa stats counter size and threshold. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:47 -07:00
Huang Ying	cbc65df240	mm, swap: add swap readahead hit statistics Patch series "mm, swap: VMA based swap readahead", v4. The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory space. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patchset, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. In addition to the swap readahead changes, some new sysfs interface is added to show the efficiency of the readahead algorithm and some other swap statistics. This new implementation will incur more small random read, on SSD, the improved correctness of estimation and readahead target should beat the potential increased overhead, this is also illustrated in the test results below. But on HDD, the overhead may beat the benefit, so the original implementation will be used by default. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. This patch (of 5): The statistics for total readahead pages and total readahead hits are recorded and exported via the following sysfs interface. /sys/kernel/mm/swap/ra_hits /sys/kernel/mm/swap/ra_total With them, the efficiency of the swap readahead could be measured, so that the swap readahead algorithm and parameters could be tuned accordingly. [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n] Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:29 -07:00
SeongJae Park	f113e64121	mm/vmstat.c: fix wrong comment Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from pagetypeinfo_show_free()'s comment. This commit fixes it. Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com Fixes: `467c996c1e` ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: SeongJae Park <sj38.park@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:29 -07:00
Wen Yang	88d6ac40c1	mm/vmstat: fix divide error at __fragmentation_index When order is -1 or too big, 1UL << order will be 0, which will cause a divide error. Although it seems that all callers of __fragmentation_index() will only do so with a valid order, the patch can make it more robust. Should prevent reoccurrences of https://bugzilla.kernel.org/show_bug.cgi?id=196555 Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:29 -07:00
Michal Hocko	c41f012ade	mm: rename global_page_state to global_zone_page_state global_page_state is error prone as a recent bug report pointed out [1]. It only returns proper values for zone based counters as the enum it gets suggests. We already have global_node_page_state so let's rename global_page_state to global_zone_page_state to be more explicit here. All existing users seems to be correct: $ git grep "global_page_state(NR_" \| sed 's@.($NR_[A-Z_]$).*@\1@' \| sort \| uniq -c 2 NR_BOUNCE 2 NR_FREE_CMA_PAGES 11 NR_FREE_PAGES 1 NR_KERNEL_STACK_KB 1 NR_MLOCK 2 NR_PAGETABLE This patch shouldn't introduce any functional change. [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:29 -07:00
Huang Ying	fe490cc0fe	mm, THP, swap: add THP swapping out fallback counting When swapping out THP (Transparent Huge Page), instead of swapping out the THP as a whole, sometimes we have to fallback to split the THP into normal pages before swapping, because no free swap clusters are available, or cgroup limit is exceeded, etc. To count the number of the fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when we fallback to split the THP. Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:28 -07:00
Huang Ying	225311a464	mm: test code to write THP to swap device as a whole To support delay splitting THP (Transparent Huge Page) after swapped out, we need to enhance swap writing code to support to write a THP as a whole. This will improve swap write IO performance. As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on multipage bvec support, which hasn't been merged yet. So this patch is only for testing the functionality of the other patches in the series. And will be reimplemented after multipage bvec support is merged. Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Shaohua Li <shli@kernel.org> Cc: Vishal L Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:28 -07:00
Vinayak Menon	727c080f03	mm: avoid taking zone lock in pagetypeinfo_showmixed() pagetypeinfo_showmixedcount_print is found to take a lot of time to complete and it does this holding the zone lock and disabling interrupts. In some cases it is found to take more than a second (On a 2.4GHz,8Gb RAM,arm64 cpu). Avoid taking the zone lock similar to what is done by read_page_owner, which means possibility of inaccurate results. Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-10 16:32:32 -07:00
Johannes Weiner	385386cff4	mm: vmstat: move slab statistics from zone to node counters Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [hannes@cmpxchg.org: fix oops] Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Konstantin Khlebnikov	8e675f7af5	mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob "memory.events" (in memory.oom_control for v1 cgroup). Also describe difference between "oom" and "oom_kill" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Michal Hocko	d336e94e44	mm, vmstat: skip reporting offline pages in pagetypeinfo pagetypeinfo_showblockcount_print skips over invalid pfns but it would report pages which are offline because those have a valid pfn. Their migrate type is misleading at best. Now that we have pfn_to_online_page() we can use it instead of pfn_valid() and fix this. [mhocko@suse.com: fix build] Link: http://lkml.kernel.org/r/20170519072225.GA13041@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170515085827.16474-11-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Joonsoo Kim <js1304@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:32 -07:00
Anshuman Khandual	9d85e15f1d	mm/vmstat.c: standardize file operations variable names Standardize the file operation variable names related to all four memory management /proc interface files. Also change all the symbol permissions (S_IRUGO) into octal permissions (0444) as it got complaints from checkpatch.pl. This does not create any functional change to the interface. Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:31 -07:00
Reza Arbab	8d35bb3106	mm, vmstat: Remove spurious WARN() during zoneinfo print After commit `e2ecc8a79e` ("mm, vmstat: print non-populated zones in zoneinfo"), /proc/zoneinfo will show unpopulated zones. A memoryless node, having no populated zones at all, was previously ignored, but will now trigger the WARN() in is_zone_first_populated(). Remove this warning, as its only purpose was to warn of a situation that has since been enabled. Aside: The "per-node stats" are still printed under the first populated zone, but that's not necessarily the first stanza any more. I'm not sure which criteria is more important with regard to not breaking parsers, but it looks a little weird to the eye. Fixes: `e2ecc8a79e` ("mm, vmstat: print node-based stats in zoneinfo file") Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-12 15:57:15 -07:00
David Rientjes	7dfb8bf3b9	mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo After "mm, vmstat: print non-populated zones in zoneinfo", /proc/zoneinfo will show unpopulated zones. The per-cpu pageset statistics are not relevant for unpopulated zones and can be potentially lengthy, so supress them when they are not interesting. Also moves lowmem reserve protection information above pcp stats since it is relevant for all zones per vm.lowmem_reserve_ratio. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:08 -07:00
David Rientjes	b2bd859819	mm, vmstat: print non-populated zones in zoneinfo Initscripts can use the information (protection levels) from /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot. vm.lowmem_reserve_ratio is an array of ratios for each configured zone on the system. If a zone is not populated on an arch, /proc/zoneinfo suppresses its output. This results in there not being a 1:1 mapping between the set of zones emitted by /proc/zoneinfo and the zones configured by vm.lowmem_reserve_ratio. This patch shows statistics for non-populated zones in /proc/zoneinfo. The zones exist and hold a spot in the vm.lowmem_reserve_ratio array. Without this patch, it is not possible to determine which index in the array controls which zone if one or more zones on the system are not populated. Remaining users of walk_zones_in_node() are unchanged. Files such as /proc/pagetypeinfo require certain zone data to be initialized properly for display, which is not done for unpopulated zones. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:08 -07:00
Shaohua Li	f7ad2a6cb9	mm: move MADV_FREE pages into LRU_INACTIVE_FILE list madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still anonymous pages, but they can be freed without pageout. To distinguish these from normal anonymous pages, we clear their SwapBacked flag. MADV_FREE pages could be freed without pageout, so they pretty much like used once file pages. For such pages, we'd like to reclaim them once there is memory pressure. Also it might be unfair reclaiming MADV_FREE pages always before used once file pages and we definitively want to reclaim the pages before other anonymous and file pages. To speed up MADV_FREE pages reclaim, we put the pages into LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. Reclaiming MADV_FREE pages will not have much interfere of anonymous and active file pages. And the inactive file pages and MADV_FREE pages will be reclaimed according to their age, so we don't reclaim too many MADV_FREE pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also means we can reclaim the pages without swap support. This idea is suggested by Johannes. This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to avoid bisect failure, next patch will do it. The patch is based on Minchan's original patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:08 -07:00
Johannes Weiner	c822f6223d	mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() NR_PAGES_SCANNED counts number of pages scanned since the last page free event in the allocator. This was used primarily to measure the reclaimability of zones and nodes, and determine when reclaim should give up on them. In that role, it has been replaced in the preceding patches by a different mechanism. Being implemented as an efficient vmstat counter, it was automatically exported to userspace as well. It's however unlikely that anyone outside the kernel is using this counter in any meaningful way. Remove the counter and the unused pgdat_reclaimable(). Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jia He <hejianet@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:08 -07:00
Johannes Weiner	c73322d098	mm: fix 100% CPU kswapd busyloop on unreclaimable nodes Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and cleanups". Jia reported a scenario in which the kswapd of a node indefinitely spins at 100% CPU usage. We have seen similar cases at Facebook. The kernel's current method of judging its ability to reclaim a node (or whether to back off and sleep) is based on the amount of scanned pages in proportion to the amount of reclaimable pages. In Jia's and our scenarios, there are no reclaimable pages in the node, however, and the condition for backing off is never met. Kswapd busyloops in an attempt to restore the watermarks while having nothing to work with. This series reworks the definition of an unreclaimable node based not on scanning but on whether kswapd is able to actually reclaim pages in MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria the page allocator uses for giving up on direct reclaim and invoking the OOM killer. If it cannot free any pages, kswapd will go to sleep and leave further attempts to direct reclaim invocations, which will either make progress and re-enable kswapd, or invoke the OOM killer. Patch #1 fixes the immediate problem Jia reported, the remainder are smaller fixlets, cleanups, and overall phasing out of the old method. Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(), and directly related to #5, but in itself not relevant to the series. If the whole series is too ambitious for 4.11, I would consider the first three patches fixes, the rest cleanups. This patch (of 9): Jia He reports a problem with kswapd spinning at 100% CPU when requesting more hugepages than memory available in the system: $ echo 4000 >/proc/sys/vm/nr_hugepages top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 At that time, there are no reclaimable pages left in the node, but as kswapd fails to restore the high watermarks it refuses to go to sleep. Kswapd needs to back away from nodes that fail to balance. Up until commit `1d82de618d` ("mm, vmscan: make kswapd reclaim in terms of nodes") kswapd had such a mechanism. It considered zones whose theoretically reclaimable pages it had reclaimed six times over as unreclaimable and backed away from them. This guard was erroneously removed as the patch changed the definition of a balanced node. However, simply restoring this code wouldn't help in the case reported here: there are no reclaimable pages that could be scanned until the threshold is met. Kswapd would stay awake anyway. Introduce a new and much simpler way of backing off. If kswapd runs through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single page, make it back off from the node. This is the same number of shots direct reclaim takes before declaring OOM. Kswapd will go to sleep on that node until a direct reclaimer manages to reclaim some pages, thus proving the node reclaimable again. [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count] Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org [shakeelb@google.com: fix condition for throttle_direct_reclaim] Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Jia He <hejianet@gmail.com> Tested-by: Jia He <hejianet@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:07 -07:00
Michal Hocko	80d136e138	mm: make mm_percpu_wq non freezable Geert has reported a freeze during PM resume and some additional debugging has shown that the device_resume worker cannot make a forward progress because it waits for an event which is stuck waiting in drain_all_pages: INFO: task kworker/u4:0:5 blocked for more than 120 seconds. Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u4:0 D 0 5 2 0x00000000 Workqueue: events_unbound async_run_entry_fn __schedule schedule schedule_timeout wait_for_common dpm_wait_for_superior device_resume async_resume async_run_entry_fn process_one_work worker_thread kthread [...] bash D 0 1703 1694 0x00000000 __schedule schedule schedule_timeout wait_for_common flush_work drain_all_pages start_isolate_page_range alloc_contig_range cma_alloc __alloc_from_contiguous cma_allocator_alloc __dma_alloc arm_dma_alloc sh_eth_ring_init sh_eth_open sh_eth_resume dpm_run_callback device_resume dpm_resume dpm_resume_end suspend_devices_and_enter pm_suspend state_store kernfs_fop_write __vfs_write vfs_write SyS_write [...] Showing busy workqueues and worker pools: [...] workqueue mm_percpu_wq: flags=0xc pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0 delayed: drain_local_pages_wq, vmstat_update pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0 delayed: drain_local_pages_wq BAR(1703), vmstat_update Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE so it is frozen this early during resume so we are effectively deadlocked. Fix this by dropping WQ_FREEZABLE when creating mm_percpu_wq. We really want to have it operational all the time. Fixes: `ce612879dd` ("mm: move pcp and lru-pcp draining into single wq") Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-04-19 15:53:48 -07:00
Michal Hocko	ce612879dd	mm: move pcp and lru-pcp draining into single wq We currently have 2 specific WQ_RECLAIM workqueues in the mm code. vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain per cpu lru caches. This seems more than necessary because both can run on a single WQ. Both do not block on locks requiring a memory allocation nor perform any allocations themselves. We will save one rescuer thread this way. On the other hand drain_all_pages() queues work on the system wq which doesn't have rescuer and so this depend on memory allocation (when all workers are stuck allocating and new ones cannot be created). Initially we thought this would be more of a theoretical problem but Hugh Dickins has reported: : 4.11-rc has been giving me hangs after hours of swapping load. At : first they looked like memory leaks ("fork: Cannot allocate memory"); : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh" : before looking at /proc/meminfo one time, and the stat_refresh stuck : in D state, waiting for completion of flush_work like many kworkers. : kthreadd waiting for completion of flush_work in drain_all_pages(). This worker should be using WQ_RECLAIM as well in order to guarantee a forward progress. We can reuse the same one as for lru draining and vmstat. Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Yang Li <pku.leo@gmail.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-04-08 00:47:49 -07:00
Michal Hocko	597b7305dd	mm: move mm_percpu_wq initialization earlier Yang Li has reported that drain_all_pages triggers a WARN_ON which means that this function is called earlier than the mm_percpu_wq is initialized on arm64 with CMA configured: WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127 Hardware name: Freescale Layerscape 2088A RDB Board (DT) task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000 PC is at drain_all_pages+0x244/0x25c LR is at start_isolate_page_range+0x14c/0x1f0 [...] drain_all_pages+0x244/0x25c start_isolate_page_range+0x14c/0x1f0 alloc_contig_range+0xec/0x354 cma_alloc+0x100/0x1fc dma_alloc_from_contiguous+0x3c/0x44 atomic_pool_init+0x7c/0x208 arm64_dma_init+0x44/0x4c do_one_initcall+0x38/0x128 kernel_init_freeable+0x1a0/0x240 kernel_init+0x10/0xfc ret_from_fork+0x10/0x20 Fix this by moving the whole setup_vmstat which is an initcall right now to init_mm_internals which will be called right after the WQ subsystem is initialized. Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Yang Li <pku.leo@gmail.com> Tested-by: Yang Li <pku.leo@gmail.com> Tested-by: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-03-31 17:13:30 -07:00
Yisheng Xie	ce9311cf95	mm/vmstats: add thp_split_pud event for clarity We added support for PUD-sized transparent hugepages, however we count the event "thp split pud" into thp_split_pmd event. To separate the event count of thp split pud from pmd, add a new event named thp_split_pud. Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-03-09 17:01:10 -08:00
David Rientjes	7f354a548d	mm, compaction: add vmstats for kcompactd work A "compact_daemon_wake" vmstat exists that represents the number of times kcompactd has woken up. This doesn't represent how much work it actually did, though. It's useful to understand how much compaction work is being done by kcompactd versus other methods such as direct compaction and explicitly triggered per-node (or system) compaction. This adds two new vmstats: "compact_daemon_migrate_scanned" and "compact_daemon_free_scanned" to represent the number of pages kcompactd has scanned as part of its migration scanner and freeing scanner, respectively. These values are still accounted for in the general "compact_migrate_scanned" and "compact_free_scanned" for compatibility. It could be argued that explicitly triggered compaction could also be tracked separately, and that could be added if others find it useful. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:29 -08:00

1 2 3 4 5 ...

272 Commits