Commit Graph

1535 Commits

Author SHA1 Message Date
Raghavendra Rao Ananta
c249464a77 Merge remote-tracking branch 'remotes/origin/tmp-aefd2d632eb0' into msm-lahaina
* remotes/origin/tmp-aefd2d632eb0:
  ANDROID: binder: fix sleeping from invalid function caused by RT inheritance
  FROMGIT: scsi: ufs: override auto suspend tunables for ufs
  FROMGIT: scsi: core: allow auto suspend override by low-level driver
  Linux 5.4-rc6
  net: fix installing orphaned programs
  net: cls_bpf: fix NULL deref on offload filter removal
  selftests: bpf: Skip write only files in debugfs
  selftests: net: reuseport_dualstack: fix uninitalized parameter
  r8169: fix wrong PHY ID issue with RTL8168dp
  net: dsa: bcm_sf2: Fix IMP setup for port different than 8
  net: phylink: Fix phylink_dbg() macro
  gve: Fixes DMA synchronization.
  inet: stop leaking jiffies on the wire
  ixgbe: Remove duplicate clear_bit() call
  Documentation: networking: device drivers: Remove stray asterisks
  e1000: fix memory leaks
  i40e: Fix receive buffer starvation for AF_XDP
  igb: Fix constant media auto sense switching when no cable is connected
  net: ethernet: arc: add the missed clk_disable_unprepare
  ANDROID: gki_defconfig: enable CONFIG_KEYBOARD_GPIO
  NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
  NFSv4: Don't allow a cached open with a revoked delegation
  arm64: apply ARM64_ERRATUM_843419 workaround for Brahma-B53 core
  arm64: Brahma-B53 is SSB and spectre v2 safe
  arm64: apply ARM64_ERRATUM_845719 workaround for Brahma-B53 core
  igb: Enable media autosense for the i350.
  igb/igc: Don't warn on fatal read failures when the device is removed
  tcp: increase tcp_max_syn_backlog max value
  net: increase SOMAXCONN to 4096
  netdevsim: Fix use-after-free during device dismantle
  rxrpc: Fix handling of last subpacket of jumbo packet
  usb: dwc3: gadget: fix race when disabling ep with cancelled xfers
  ANDROID: Remove KVM_INTEL allmodconfig workaround
  ANDROID: Fix x86_64 allmodconfig build
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  s390/idle: fix cpu idle time calculation
  s390/unwind: fix mixing regs and sp
  s390/cmm: fix information leak in cmm_timeout_handler()
  arm64: cpufeature: Enable Qualcomm Falkor errata 1009 for Kryo
  KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active
  kvm: call kvm_arch_destroy_vm if vm creation fails
  efi/efi_test: Lock down /dev/efi_test and require CAP_SYS_ADMIN
  x86, efi: Never relocate kernel below lowest acceptable address
  efi: libstub/arm: Account for firmware reserved memory at the base of RAM
  efi/random: Treat EFI_RNG_PROTOCOL output as bootloader randomness
  efi/tpm: Return -EINVAL when determining tpm final events log size fails
  efi: Make CONFIG_EFI_RCI2_TABLE selectable on x86 only
  hv_netvsc: Fix error handling in netvsc_attach()
  hv_netvsc: Fix error handling in netvsc_set_features()
  cxgb4: fix panic when attaching to ULD fail
  net: annotate lockless accesses to sk->sk_napi_id
  FROMLIST: scsi: ufs-qcom: enter and exit hibern8 during clock scaling
  FROMLIST: scsi: ufs: export hibern8 entry and exit
  ANDROID: scsi: ufs: UFS crypto variant operations API
  ANDROID: gki_defconfig: enable inline encryption
  FROMLIST: ext4: add inline encryption support
  FROMLIST: f2fs: add inline encryption support
  FROMLIST: fscrypt: add inline encryption support
  FROMLIST: scsi: ufs: Add inline encryption support to UFS
  FROMLIST: scsi: ufs: UFS crypto API
  FROMLIST: scsi: ufs: UFS driver v2.1 spec crypto additions
  FROMLIST: block: blk-crypto for Inline Encryption
  ANDROID: block: Fix bio_crypt_should_process WARN_ON
  ALSA: timer: Fix mutex deadlock at releasing card
  FROMLIST: block: Add encryption context to struct bio
  io_uring: ensure we clear io_kiocb->result before each issue
  parisc: fix frame pointer in ftrace_regs_caller()
  net: annotate accesses to sk->sk_incoming_cpu
  FROMLIST: block: Keyslot Manager for Inline Encryption
  FROMLIST: f2fs: add support for IV_INO_LBLK_64 encryption policies
  FROMLIST: ext4: add support for IV_INO_LBLK_64 encryption policies
  FROMLIST: fscrypt: add support for IV_INO_LBLK_64 policies
  FROMLIST: docs: ioctl-number: document fscrypt ioctl numbers
  FROMLIST: fscrypt: zeroize fscrypt_info before freeing
  FROMLIST: fscrypt: remove struct fscrypt_ctx
  FROMLIST: fscrypt: invoke crypto API for ESSIV handling
  mlxsw: core: Unpublish devlink parameters during reload
  qed: Optimize execution time for nvm attributes configuration.
  vxlan: fix unexpected failure of vxlan_changelink()
  qed: fix spelling mistake "queuess" -> "queues"
  SUNRPC: Destroy the back channel when we destroy the host transport
  SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
  SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
  drm/amdgpu: enable -msse2 for GCC 7.1+ users
  drm/amdgpu: fix stack alignment ABI mismatch for GCC 7.1+
  drm/amdgpu: fix stack alignment ABI mismatch for Clang
  drm/radeon: Fix EEH during kexec
  drm/amdgpu/gmc10: properly set BANK_SELECT and FRAGMENT_SIZE
  drm/amdgpu/powerplay/vega10: allow undervolting in p7
  dc.c:use kzalloc without test
  drm/amd/display: setting the DIG_MODE to the correct value.
  drm/amd/display: Passive DP->HDMI dongle detection fix
  drm/amd/display: add 50us buffer as WA for pstate switch in active
  drm/amd/display: Allow inverted gamma
  drm/amd/display: do not synchronize "drr" displays
  drm/amdgpu: If amdgpu_ib_schedule fails return back the error.
  drm/sched: Set error to s_fence if HW job submission failed.
  drm/amdgpu/gfx10: update gfx golden settings for navi12
  drm/amdgpu/gfx10: update gfx golden settings for navi14
  drm/amdgpu/gfx10: update gfx golden settings
  drm/amd/display: Change Navi14's DWB flag to 1
  drm/amdgpu/sdma5: do not execute 0-sized IBs (v2)
  drm/amdgpu: Fix SDMA hang when performing VKexample test
  iwlwifi: fw api: support new API for scan config cmd
  mt76: dma: fix buffer unmap with non-linear skbs
  mt76: mt76x2e: disable pcie_aspm by default
  ALSA: hda - Fix mutex deadlock in HDMI codec driver
  usb: cdns3: gadget: Fix g_audio use case when connected to Super-Speed host
  usb: cdns3: gadget: reset EP_CLAIMED flag while unloading
  gfs2: Fix initialisation of args for remount
  iommu/vt-d: Fix panic after kexec -p for kdump
  iommu/amd: Apply the same IVRS IOAPIC workaround to Acer Aspire A315-41
  iommu/ipmmu-vmsa: Remove dev_err() on platform_get_irq() failure
  nl80211: fix validation of mesh path nexthop
  nl80211: Disallow setting of HT for channel 14
  USB: serial: whiteheat: fix line-speed endianness
  USB: serial: whiteheat: fix potential slab corruption
  MAINTAINERS: Change to my personal email address
  drm/i915: Fix PCH reference clock for FDI on HSW/BDW
  net: rtnetlink: fix a typo fbd -> fdb
  net/smc: fix refcounting for non-blocking connect()
  bonding: fix using uninitialized mode_lock
  net: fec_ptp: Use platform_get_irq_xxx_optional() to avoid error message
  net: fec_main: Use platform_get_irq_byname_optional() to avoid error message
  MAINTAINERS: remove Dave Watson as TLS maintainer
  vxlan: check tun_info options_len properly
  erspan: fix the tun_info options_len check for erspan
  net: hisilicon: Fix ping latency when deal with high throughput
  net/mlx4_core: Dynamically set guaranteed amount of counters per VF
  net/mlx5e: Initialize on stack link modes bitmap
  net/mlx5e: Fix ethtool self test: link speed
  net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget
  net/mlx5e: Don't store direct pointer to action's tunnel info
  net/mlx5: Fix NULL pointer dereference in extended destination
  net/mlx5: Fix rtable reference leak
  net/mlx5e: Only skip encap flows update when encap init failed
  net/mlx5e: Replace kfree with kvfree when free vhca stats
  net/mlx5e: Remove incorrect match criteria assignment line
  net/mlx5e: Determine source port properly for vlan push action
  net/mlx5: Fix flow counter list auto bits struct
  net: mscc: ocelot: refuse to overwrite the port's native vlan
  net: mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is up
  wimax: i2400: Fix memory leak in i2400m_op_rfkill_sw_toggle
  drm/i915/tgl: Fix doc not corresponding to code
  ANDROID: staging: ion: Fix dynamic heap ID assignment
  ANDROID: Fix typo for FROMLIST: section
  drm/panfrost: Don't dereference bogus MMU pointers
  ANDROID: media: increase video max frame number
  drm/panfrost: fix -Wmissing-prototypes warnings
  net: hisilicon: Fix "Trying to free already-free IRQ"
  fjes: Handle workqueue allocation failure
  Revert "sched: Rework pick_next_task() slow-path"
  arm64: cpufeature: Enable Qualcomm Falkor/Kryo errata 1003
  drm/etnaviv: fix dumping of iommuv2
  drm/etnaviv: reinstate MMUv1 command buffer window check
  drm/etnaviv: fix deadlock in GPU coredump
  arm64: Ensure VM_WRITE|VM_SHARED ptes are clean by default
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  powerpc/powernv: Fix CPU idle to be called with IRQs disabled
  sched/topology: Allow sched_asym_cpucapacity to be disabled
  sched/topology: Don't try to build empty sched domains
  USB: gadget: Reject endpoints with 0 maxpacket value
  powerpc/prom_init: Undo relocation before entering secure mode
  ANDROID: dummy_cpufreq: Implement get()
  ANDROID: gki_defconfig: enable CONFIG_CPUSETS
  ANDROID: virtio: virtio_input: Set the amount of multitouch slots in virtio input
  scsi: qla2xxx: stop timer in shutdown path
  hwmon: (ina3221) Fix read timeout issue
  net: usb: lan78xx: Disable interrupts before calling generic_handle_irq()
  Revert "ANDROID: Revert "kheaders: make headers archive reproducible""
  net: dsa: sja1105: improve NET_DSA_SJA1105_TAS dependency
  net: ethernet: ftgmac100: Fix DMA coherency issue with SW checksum
  net: fix sk_page_frag() recursion from memory reclaim
  udp: fix data-race in udp_set_dev_scratch()
  net: dpaa2: Use the correct style for SPDX License Identifier
  net: add READ_ONCE() annotation in __skb_wait_for_more_packets()
  net: use skb_queue_empty_lockless() in busy poll contexts
  net: use skb_queue_empty_lockless() in poll() handlers
  udp: use skb_queue_empty_lockless()
  net: add skb_queue_empty_lockless()
  ANDROID: gki_defconfig: enable CONFIG_CPU_FREQ_GOV_CONSERVATIVE
  RDMA/hns: Prevent memory leaks of eq->buf_list
  RISC-V: Add PCIe I/O BAR memory mapping
  RDMA/iw_cxgb4: Avoid freeing skb twice in arp failure case
  RDMA/mlx5: Use irq xarray locking for mkey_table
  ANDROID: fix VIDEOBUF2_CORE dependency in 'allmodconfig' builds
  UAS: Revert commit 3ae62a4209 ("UAS: fix alignment of scatter/gather segments")
  usb-storage: Revert commit 747668dbc0 ("usb-storage: Set virt_boundary_mask to avoid SG overflows")
  usbip: Fix free of unallocated memory in vhci tx
  usbip: tools: Fix read_usb_vudc_device() error path handling
  usb: xhci: fix __le32/__le64 accessors in debugfs code
  usb: xhci: fix Immediate Data Transfer endianness
  xhci: Fix use-after-free regression in xhci clear hub TT implementation
  USB: ldusb: fix control-message timeout
  USB: ldusb: use unsigned size format specifiers
  USB: ldusb: fix ring-buffer locking
  USB: Skip endpoints with 0 maxpacket length
  io_uring: don't touch ctx in setup after ring fd install
  ANDROID: modpost: fix up merge issues due to namespace removal
  Revert "ALSA: hda: Flush interrupts on disabling"
  perf/headers: Fix spelling s/EACCESS/EACCES/, s/privilidge/privilege/
  perf/x86/uncore: Fix event group support
  perf/x86/amd/ibs: Handle erratum #420 only on the affected CPU family (10h)
  perf/x86/amd/ibs: Fix reading of the IBS OpData register and thus precise RIP validity
  perf/core: Start rejecting the syscall with attr.__reserved_2 set
  vringh: fix copy direction of vringh_iov_push_kern()
  vsock/virtio: remove unused 'work' field from 'struct virtio_vsock_pkt'
  virtio_ring: fix stalls for packed rings
  riscv: for C functions called only from assembly, mark with __visible
  riscv: fp: add missing __user pointer annotations
  riscv: add missing header file includes
  riscv: mark some code and data as file-static
  riscv: init: merge split string literals in preprocessor directive
  riscv: add prototypes for assembly language functions from head.S
  io_uring: Fix leaked shadow_req
  fix memory leak in large read decrypt offload
  Linux 5.4-rc5
  usb: cdns3: gadget: Don't manage pullups
  usb: dwc3: remove the call trace of USBx_GFLADJ
  usb: gadget: configfs: fix concurrent issue between composite APIs
  usb: dwc3: pci: prevent memory leak in dwc3_pci_probe
  usb: gadget: composite: Fix possible double free memory bug
  usb: gadget: udc: atmel: Fix interrupt storm in FIFO mode.
  usb: renesas_usbhs: fix type of buf
  usb: renesas_usbhs: Fix warnings in usbhsg_recip_handler_std_set_device()
  usb: gadget: udc: renesas_usb3: Fix __le16 warnings
  usb: renesas_usbhs: fix __le16 warnings
  usb: cdns3: include host-export,h for cdns3_host_init
  usb: mtu3: fix missing include of mtu3_dr.h
  usb: fsl: Check memory resource before releasing it
  usb: dwc3: select CONFIG_REGMAP_MMIO
  ANDROID: add README.md
  selftests: fib_tests: add more tests for metric update
  ipv4: fix route update on metric change.
  net: Zeroing the structure ethtool_wolinfo in ethtool_get_wol()
  ALSA: bebob: Fix prototype of helper function to return negative value
  cxgb4: request the TX CIDX updates to status page
  netns: fix GFP flags in rtnl_net_notifyid()
  net: ethernet: Use the correct style for SPDX License Identifier
  net/smc: keep vlan_id for SMC-R in smc_listen_work()
  net/smc: fix closing of fallback SMC sockets
  ANDROID: virt_wifi: Add data ops for scan data simulation
  ANDROID: Allow DRM_IOCTL_MODE_*_DUMB for render clients.
  riscv: cleanup do_trap_break
  net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static
  ANDROID: cpufreq: create dummy cpufreq driver
  net: mvneta: make stub functions static inline
  net: sch_generic: Use pfifo_fast as fallback scheduler for CAN hardware
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  Revert "ANDROID: x86: Remove a useless warning message"
  ANDROID: init: GKI: enable hidden configs for media
  ANDROID: gki_defconfig: add FORTIFY_SOURCE, remove SPMI_MSM_PMIC_ARB
  Revert "Revert "Revert "Revert "x86/mm: Identify the end of the kernel area to be reserved""""
  build.config.*: Link android-mainline kernels with LLD
  ANDROID: ALSA: jack: Update supported jack switch types
  ANDROID: ASoC: compress: fix unsigned integer overflow check
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  FROMLIST: iommu: Export core IOMMU functions to kernel modules
  FROMLIST: PCI: Export PCI ACS and DMA searching functions to modules
  FROMLIST: of: Export of_phandle_iterator_args() to modules
  ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
  io_uring: Fix race for sqes with userspace
  io_uring: Fix broken links with offloading
  io_uring: Fix corrupted user_data
  xen: issue deprecation warning for 32-bit pv guest
  kvm: Allocate memslots and buses before calling kvm_arch_init_vm
  powerpc/powernv/eeh: Fix oops when probing cxl devices
  irqchip/sifive-plic: Skip contexts except supervisor in plic_init()
  ANDROID: soc: qcom: Add required header to irq.h
  ACPI: processor: Add QoS requests for all CPUs
  cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
  CIFS: Fix use after free of file info structures
  CIFS: Fix retry mid list corruption on reconnects
  scsi: sd: define variable dif as unsigned int instead of bool
  ANDROID: v4l2-compat-ioctl32.c: copy reserved fields
  scsi: target: cxgbit: Fix cxgbit_fw4_ack()
  IB/core: Avoid deadlock during netlink message handling
  ANDROID: of: property: Enable of_devlink by default
  virt_wifi: fix refcnt leak in module exit routine
  net: remove unnecessary variables and callback
  vxlan: add adjacent link to limit depth level
  net: core: add ignore flag to netdev_adjacent structure
  macsec: fix refcnt leak in module exit routine
  team: fix nested locking lockdep warning
  bonding: use dynamic lockdep key instead of subclass
  bonding: fix unexpected IFF_BONDING bit unset
  net: core: add generic lockdep keys
  net: core: limit nested device depth
  keys: Fix memory leak in copy_net_ns
  ANDROID: of: property: Make sure child dependencies don't block probing of parent
  ANDROID: driver core: Allow fwnode_operations.add_links to differentiate errors
  ANDROID: driver core: Allow a device to wait on optional suppliers
  ANDROID: driver core: Add device link support for SYNC_STATE_ONLY flag
  FROMGIT: docs: driver-model: Add documentation for sync_state
  FROMGIT: driver: core: Improve documentation for fwnode_operations.add_links()
  FROMGIT: of: property: Minor code formatting/style clean ups
  i2c: stm32f7: remove warning when compiling with W=1
  i2c: stm32f7: fix a race in slave mode with arbitration loss irq
  i2c: stm32f7: fix first byte to send in slave mode
  i2c: mt65xx: fix NULL ptr dereference
  RDMA/nldev: Skip counter if port doesn't match
  irqchip/gic-v3-its: Use the exact ITSList for VMOVP
  FROMLIST: drivers: pinctrl: msm: setup GPIO chip in hierarchy
  FROMLIST: drivers: irqchip: pdc: Add irqchip set/get state calls
  FROMLIST: genirq: Introduce irq_chip_get/set_parent_state calls
  FROMLIST: drivers: irqchip: pdc: additionally set type in SPI config registers
  FROMLIST: dt-bindings/interrupt-controller: pdc: add SPI config register
  FROMLIST: of: irq: document properties for wakeup interrupt parent
  FROMLIST: drivers: irqchip: add PDC irqdomain for wakeup capable GPIOs
  FROMLIST: drivers: irqchip: pdc: Do not toggle IRQ_ENABLE during mask/unmask
  FROMLIST: drivers: irqchip: qcom-pdc: update max PDC interrupts
  FROMLIST: irqdomain: add bus token DOMAIN_BUS_WAKEUP
  gfs2: Fix memory leak when gfs2meta's fs_context is freed
  ALSA: hda/realtek - Fix 2 front mics of codec 0x623
  ALSA: hda/realtek - Add support for ALC623
  ALSA: usb-audio: Add DSD support for Gustard U16/X26 USB Interface
  netfilter: nft_payload: fix missing check for matching length in offloads
  ipvs: move old_secure_tcp into struct netns_ipvs
  ipvs: don't ignore errors in case refcounting ip_vs module fails
  ANDROID: drop patches/ symbolic link
  mfd: mt6397: Fix probe after changing mt6397-core
  net: phy: smsc: LAN8740: add PHY_RST_AFTER_CLK_EN flag
  MIPS: tlbex: Fix build_restore_pagemask KScratch restore
  io_uring: correct timeout req sequence when inserting a new entry
  io_uring : correct timeout req sequence when waiting timeout
  io_uring: revert "io_uring: optimize submit_and_wait API"
  MIPS: bmips: mark exception vectors as char arrays
  xsk: Fix registration of Rx-only sockets
  net/flow_dissector: switch to siphash
  MAINTAINERS: Update the Spreadtrum SoC maintainer
  ANDROID: Revert "ANDROID: Removed check for asm-goto"
  riscv: cleanup <asm/bug.h>
  riscv: Fix undefined reference to vmemmap_populate_basepages
  riscv: Fix implicit declaration of 'page_to_section'
  riscv: fix fs/proc/kcore.c compilation with sparsemem enabled
  ANDROID: sdcardfs: evict dentries on fscrypt key removal
  ANDROID: fscrypt: add key removal notifier chain
  ANDROID: move up spin_unlock_bh() ahead of remove_proc_entry()
  ANDROID: Kconfig.gki: Add hidden MMC config support
  ANDROID: Kconfig.gki: Add Hidden QCOM configs
  ANDROID: Kconfig.gki: Add SND_PCM_ELD to HIDDEN_DRM configs
  ANDROID: Kconfig.gki: Add extra audio selections
  ANDROID: Kconfig.gki: Add extra GKI_HIDDEN_REGMAP_CONFIGS selections
  ANDROID: Four part re-add of asm-goto usage [4/4]
  ANDROID: Four part re-add of asm-goto usage [3/4]
  ANDROID: Four part re-add of asm-goto usage [2/4]
  ANDROID: Four part re-add of asm-goto usage [1/4]
  ANDROID: Move out patches/ and replace by link to kernel/common-patches project
  of: reserved_mem: add missing of_node_put() for proper ref-counting
  of: unittest: fix memory leak in unittest_data_add
  dt-bindings: riscv: Fix CPU schema errors
  MAINTAINERS: Remove Gregory and Brian for ARCH_BRCMSTB
  drm/v3d: Fix memory leak in v3d_submit_cl_ioctl
  panfrost: Properly undo pm_runtime_enable when deferring a probe
  dmaengine: cppi41: Fix cppi41_dma_prep_slave_sg() when idle
  posix-cpu-timers: Fix two trivial comments
  timers/sched_clock: Include local timekeeping.h for missing declarations
  lib/vdso: Make clock_getres() POSIX compliant again
  fuse: redundant get_fuse_inode() calls in fuse_writepages_fill()
  fuse: Add changelog entries for protocols 7.1 - 7.8
  fuse: truncate pending writes on O_TRUNC
  fuse: flush dirty data/metadata before non-truncate setattr
  netfilter: nf_tables_offload: restore basechain deletion
  netfilter: nf_flow_table: set timeout before insertion into hashes
  rtlwifi: rtl_pci: Fix problem of too small skb->len
  iwlwifi: pcie: 0x2720 is qu and 0x30DC is not
  iwlwifi: pcie: add workaround for power gating in integrated 22000
  iwlwifi: mvm: handle iwl_mvm_tvqm_enable_txq() error return
  iwlwifi: pcie: fix all 9460 entries for qnj
  iwlwifi: pcie: fix PCI ID 0x2720 configs that should be soc
  rtlwifi: Fix potential overflow on P2P code
  iwlwifi: pcie: fix merge damage on making QnJ exclusive
  scripts/nsdeps: use alternative sed delimiter
  virtiofs: Remove set but not used variable 'fc'
  fs/dax: Fix pmd vs pte conflict detection
  opp: Reinitialize the list_kref before adding the static OPPs again
  bpf: Fix use after free in bpf_get_prog_name
  ALSA: hda: Add Tigerlake/Jasperlake PCI ID
  scsi: qla2xxx: Fix partial flash write of MBI
  scsi: qla2xxx: Initialized mailbox to prevent driver load failure
  scsi: lpfc: Honor module parameter lpfc_use_adisc
  ipv6: include <net/addrconf.h> for missing declarations
  net: openvswitch: free vport unless register_netdevice() succeeds
  selftests: Make l2tp.sh executable
  net: sched: taprio: fix -Wmissing-prototypes warnings
  bnxt_en: Avoid disabling pci device in bnxt_remove_one() for already disabled device.
  bnxt_en: Minor formatting changes in FW devlink_health_reporter
  bnxt_en: Adjust the time to wait before polling firmware readiness.
  bnxt_en: Fix devlink NVRAM related byte order related issues.
  bnxt_en: Fix the size of devlink MSIX parameters.
  net: stmmac: Fix the problem of tso_xmit
  dynamic_debug: provide dynamic_hex_dump stub
  bpf: Fix use after free in subprog's jited symbol removal
  RDMA/uverbs: Prevent potential underflow
  KVM: nVMX: Don't leak L1 MMIO regions to L2
  ARC: perf: Accommodate big-endian CPU
  ARC: [plat-hsdk]: Enable on-boardi SPI ADC IC
  ARC: [plat-hsdk]: Enable on-board SPI NOR flash IC
  KVM: SVM: Fix potential wrong physical id in avic_handle_ldr_update
  cpufreq: Cancel policy update work scheduled before freeing
  s390/kaslr: add support for R_390_GLOB_DAT relocation type
  s390/zcrypt: fix memleak at release
  ALSA: usb-audio: Fix copy&paste error in the validator
  perf/aux: Fix AUX output stopping
  kvm: clear kvmclock MSR on reset
  KVM: x86: fix bugon.cocci warnings
  KVM: VMX: Remove specialized handling of unexpected exit-reasons
  selftests: kvm: fix sync_regs_test with newer gccs
  selftests: kvm: vmx_dirty_log_test: skip the test when VMX is not supported
  selftests: kvm: consolidate VMX support checks
  selftests: kvm: vmx_set_nested_state_test: don't check for VMX support twice
  KVM: Don't shrink/grow vCPU halt_poll_ns if host side polling is disabled
  selftests: kvm: synchronize .gitignore to Makefile
  kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID
  cpuidle: haltpoll: Take 'idle=' override into account
  ACPI: NFIT: Fix unlock on error in scrub_show()
  tracing: Fix race in perf_trace_buf initialization
  x86/cpu/vmware: Fix platform detection VMWARE_PORT macro
  x86/cpu/vmware: Use the full form of INL in VMWARE_HYPERCALL, for clang/llvm
  xdp: Handle device unregister for devmap_hash map type
  r8152: add device id for Lenovo ThinkPad USB-C Dock Gen 2
  ipv4: fix IPSKB_FRAG_PMTU handling with fragmentation
  ARM: 8926/1: v7m: remove register save to stack before svc
  Input: st1232 - fix reporting multitouch coordinates
  Revert "pwm: Let pwm_get_state() return the last implemented state"
  mmc: mxs: fix flags passed to dmaengine_prep_slave_sg
  virtiofs: Retry request submission from worker context
  virtiofs: Count pending forgets as in_flight forgets
  virtiofs: Set FR_SENT flag only after request has been sent
  virtiofs: No need to check fpq->connected state
  virtiofs: Do not end request in submission context
  fuse: don't advise readdirplus for negative lookup
  drm/komeda: Fix typos in komeda_splitter_validate
  drm/komeda: Don't flush inactive pipes
  i2c: aspeed: fix master pending state handling
  mmc: cqhci: Commit descriptors before setting the doorbell
  mmc: sdhci-omap: Fix Tuning procedure for temperatures < -20C
  ALSA: hda/realtek - Add support for ALC711
  perf/aux: Fix tracking of auxiliary trace buffer allocation
  fuse: don't dereference req->args on finished request
  opp: core: Revert "add regulators enable and disable"
  cifs: Fix missed free operations
  CIFS: avoid using MID 0xFFFF
  cifs: clarify comment about timestamp granularity for old servers
  cifs: Handle -EINPROGRESS only when noblockcnt is set
  PM: QoS: Drop frequency QoS types from device PM QoS
  cpufreq: Use per-policy frequency QoS
  PM: QoS: Introduce frequency QoS
  Linux 5.4-rc4
  hwmon: (nct7904) Fix the incorrect value of vsen_mask & tcpu_mask & temp_mode in nct7904_data struct.
  perf/x86/intel/pt: Fix base for single entry topa
  KVM: arm64: pmu: Reset sample period on overflow handling
  KVM: arm64: pmu: Set the CHAINED attribute before creating the in-kernel event
  arm64: KVM: Handle PMCR_EL0.LC as RES1 on pure AArch64 systems
  KVM: arm64: pmu: Fix cycle counter truncation
  net: reorder 'struct net' fields to avoid false sharing
  net: dsa: fix switch tree list
  net: ethernet: dwmac-sun8i: show message only when switching to promisc
  net: aquantia: add an error handling in aq_nic_set_multicast_list
  net: netem: correct the parent's backlog when corrupted packet was dropped
  net: netem: fix error path for corrupted GSO frames
  macb: propagate errors when getting optional clocks
  xen/netback: fix error path of xenvif_connect_data()
  net: hns3: fix mis-counting IRQ vector numbers issue
  scripts/gdb: fix debugging modules on s390
  kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
  mm/thp: allow dropping THP from page cache
  mm/vmscan.c: support removing arbitrary sized pages from mapping
  mm/thp: fix node page state in split_huge_page_to_list()
  proc/meminfo: fix output alignment
  mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
  mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
  mm: include <linux/huge_mm.h> for is_vma_temporary_stack
  zram: fix race between backing_dev_show and backing_dev_store
  mm/memcontrol: update lruvec counters in mem_cgroup_move_account
  ocfs2: fix panic due to ocfs2_wq is null
  hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
  mm: memblock: do not enforce current limit for memblock_phys* family
  mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
  mm/gup: fix a misnamed "write" argument, and a related bug
  mm/gup_benchmark: add a missing "w" to getopt string
  ocfs2: fix error handling in ocfs2_setattr()
  mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
  mm/memunmap: don't access uninitialized memmap in memunmap_pages()
  mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()
  mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo
  scripts/gdb: fix lx-dmesg when CONFIG_PRINTK_CALLER is set
  mm/memory-failure.c: don't access uninitialized memmaps in memory_failure()
  fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c
  drivers/base/memory.c: don't access uninitialized memmaps in soft_offline_page_store()
  xdp: Prevent overflow in devmap_hash cost calculation for 32-bit builds
  filldir[64]: remove WARN_ON_ONCE() for bad directory entries
  scsi: ufs-bsg: Wake the device before sending raw upiu commands
  scsi: lpfc: Check queue pointer before use
  mips: vdso: Fix __arch_get_hw_counter()
  MAINTAINERS: Use @kernel.org address for Paul Burton
  scsi: qla2xxx: fixup incorrect usage of host_byte
  selftests/bpf: More compatible nc options in test_tc_edt
  net/mlx5: fix memory leak in mlx5_fw_fatal_reporter_dump
  net/mlx5: prevent memory leak in mlx5_fpga_conn_create_cq
  net/mlx5e: TX, Fix consumer index of error cqe dump
  net/mlx5e: kTLS, Enhance TX resync flow
  net/mlx5e: kTLS, Save a copy of the crypto info
  net/mlx5e: kTLS, Remove unneeded cipher type checks
  net/mlx5e: kTLS, Limit DUMP wqe size
  net/mlx5e: kTLS, Fix missing SQ edge fill
  net/mlx5e: kTLS, Fix page refcnt leak in TX resync error flow
  net/mlx5e: kTLS, Save by-value copy of the record frags
  net/mlx5e: kTLS, Save only the frag page to release at completion
  net/mlx5e: kTLS, Size of a Dump WQE is fixed
  net/mlx5e: kTLS, Release reference on DUMPed fragments in shutdown flow
  net/mlx5e: Tx, Zero-memset WQE info struct upon update
  net/mlx5e: Tx, Fix assumption of single WQEBB of NOP in cleanup flow
  usb: cdns3: Error out if USB_DR_MODE_UNKNOWN in cdns3_core_init_role()
  ARM: dts: bcm2837-rpi-cm3: Avoid leds-gpio probing issue
  USB: ldusb: fix read info leaks
  IB/core: Use rdma_read_gid_l2_fields to compare GID L2 fields
  RDMA/qedr: Fix reported firmware version
  RDMA/siw: free siw_base_qp in kref release routine
  tracing: Fix "gfp_t" format for synthetic events
  RDMA/iwcm: move iw_rem_ref() calls out of spinlock
  iw_cxgb4: fix ECN check on the passive accept
  net: usb: lan78xx: Connect PHY before registering MAC
  vsock/virtio: discard packets if credit is not respected
  vsock/virtio: send a credit update when buffer size is changed
  mlxsw: spectrum_trap: Push Ethernet header before reporting trap
  ASoC: SOF: control: return true when kcontrol values change
  ASoC: stm32: sai: fix sysclk management on shutdown
  ASoC: Intel: sof-rt5682: add a check for devm_clk_get
  ASoC: rsnd: Reinitialize bit clock inversion flag for every format setting
  net: ensure correct skb->tstamp in various fragmenters
  net: bcmgenet: reset 40nm EPHY on energy detect
  net: bcmgenet: soft reset 40nm EPHYs before MAC init
  net: phy: bcm7xxx: define soft_reset for 40nm EPHY
  net: bcmgenet: don't set phydev->link from MAC
  bus: ti-sysc: Fix watchdog quirk handling
  ARM: OMAP2+: Add pdata for OMAP3 ISP IOMMU
  ARM: OMAP2+: Plug in device_enable/idle ops for IOMMUs
  iommu/vt-d: Return the correct dma mask when we are bypassing the IOMMU
  iommu/amd: Check PM_LEVEL_SIZE() condition in locked section
  nvme-pci: Set the prp2 correctly when using more than 4k page
  HID: i2c-hid: add Trekstor Primebook C11B to descriptor override
  symbol namespaces: revert to previous __ksymtab name scheme
  modpost: make updating the symbol namespace explicit
  modpost: delegate updating namespaces to separate function
  HID: logitech-hidpp: do all FF cleanup in hidpp_ff_destroy()
  HID: logitech-hidpp: rework device validation
  HID: logitech-hidpp: split g920_get_config()
  HID: i2c-hid: Remove runtime power management
  x86/boot/acpi: Move get_cmdline_acpi_rsdp() under #ifdef guard
  x86/hyperv: Set pv_info.name to "Hyper-V"
  ACPI: CPPC: Set pcc_data[pcc_ss_id] to NULL in acpi_cppc_processor_exit()
  dmaengine: qcom: bam_dma: Fix resource leak
  scsi: lpfc: remove left-over BUILD_NVME defines
  scsi: core: try to get module before removing device
  scsi: hpsa: add missing hunks in reset-patch
  scsi: target: core: Do not overwrite CDB byte 1
  net: Update address for MediaTek ethernet driver in MAINTAINERS
  ipv4: fix race condition between route lookup and invalidation
  ipv4: Return -ENETUNREACH if we can't create route but saddr is valid
  net: phy: micrel: Update KSZ87xx PHY name
  net: phy: micrel: Discern KSZ8051 and KSZ8795 PHYs
  io_uring: fix logic error in io_timeout
  io_uring: fix up O_NONBLOCK handling for sockets
  drm/amdgpu/vce: fix allocation size in enc ring test
  drm/amdgpu: fix error handling in amdgpu_bo_list_create
  drm/amdgpu: fix potential VM faults
  drm/amdgpu: user pages array memory leak fix
  drm/amdgpu/vcn: fix allocation size in enc ring test
  drm/amdgpu/uvd7: fix allocation size in enc ring test (v2)
  drm/amdgpu/uvd6: fix allocation size in enc ring test (v2)
  IB/hfi1: Use a common pad buffer for 9B and 16B packets
  IB/hfi1: Avoid excessive retry for TID RDMA READ request
  RDMA/mlx5: Clear old rate limit when closing QP
  net: dsa: microchip: Add shared regmap mutex
  net: dsa: microchip: Do not reinit mutexes on KSZ87xx
  net: stmmac: fix argument to stmmac_pcs_ctrl_ane()
  dpaa2-eth: Fix TX FQID values
  dpaa2-eth: add irq for the dpmac connect/disconnect event
  usb: hso: obey DMA rules in tiocmget
  Btrfs: check for the full sync flag while holding the inode lock during fsync
  Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
  coccinelle: api/devm_platform_ioremap_resource: remove useless script
  ALSA: hda - Force runtime PM on Nvidia HDMI codecs
  dm cache: fix bugs when a GFP_NOWAIT allocation fails
  ARM: davinci_all_defconfig: enable GPIO backlight
  ARM: davinci: dm365: Fix McBSP dma_slave_map entry
  binder: Don't modify VMA bounds in ->mmap handler
  btrfs: tracepoints: Fix bad entry members of qgroup events
  btrfs: tracepoints: Fix wrong parameter order for qgroup events
  stop_machine: Avoid potential race behaviour
  EDAC/ghes: Fix Use after free in ghes_edac remove path
  ALSA: hda/realtek - Enable headset mic on Asus MJ401TA
  ALSA: usb-audio: Disable quirks for BOSS Katana amplifiers
  kheaders: substituting --sort in archive creation
  powerpc/32s: fix allow/prevent_user_access() when crossing segment boundaries.
  net: stmmac: disable/enable ptp_ref_clk in suspend/resume flow
  net: phy: Fix "link partner" information disappear issue
  rxrpc: use rcu protection while reading sk->sk_user_data
  drm/i915: Fixup preempt-to-busy vs resubmission of a virtual request
  drm/i915/userptr: Never allow userptr into the mappable GGTT
  drm/i915: Favor last VBT child device with conflicting AUX ch/DDC pin
  drm/i915/execlists: Refactor -EIO markup of hung requests
  Revert "blackhole_netdev: fix syzkaller reported issue"
  arm64: tags: Preserve tags for addresses translated via TTBR1
  arm64: mm: fix inverted PAR_EL1.F check
  arm64: sysreg: fix incorrect definition of SYS_PAR_EL1_F
  arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled
  md/raid0: fix warning message for parameter default_layout
  kthread: make __kthread_queue_delayed_work static
  pinctrl: aspeed-g6: Rename SD3 to EMMC and rework pin groups
  pinctrl: aspeed-g6: Fix UART13 group pinmux
  pinctrl: aspeed-g6: Make SIG_DESC_CLEAR() behave intuitively
  pinctrl: aspeed-g6: Fix I3C3/I3C4 pinmux configuration
  pinctrl: aspeed-g6: Fix I2C14 SDA description
  pinctrl: aspeed-g6: Sort pins for sanity
  dt-bindings: pinctrl: aspeed-g6: Rework SD3 function and groups
  perf kmem: Fix memory leak in compact_gfp_flags()
  usercopy: Avoid soft lockups in test_check_nonzero_user()
  pinctrl: berlin: as370: fix a typo s/spififib/spdifib
  ACPI: processor: Avoid NULL pointer dereferences at init time
  USB: serial: ti_usb_3410_5052: clean up serial data access
  USB: serial: ti_usb_3410_5052: fix port-close races
  xtensa: fix change_bit in exclusive access option
  HID: intel-ish-hid: fix wrong error handling in ishtp_cl_alloc_tx_ring()
  RISC-V: fix virtual address overlapped in FIXADDR_START and VMEMMAP_START
  net: usb: sr9800: fix uninitialized local variable
  net: bcmgenet: Fix RGMII_MODE_EN value for GENET v1/2/3
  net: stmmac: make tc_flow_parsers static
  davinci_cpdma: make cpdma_chan_split_pool static
  net: i82596: fix dma_alloc_attr for sni_82596
  sctp: change sctp_prot .no_autobind with true
  sched: etf: Fix ordering of packets with same txtime
  net: avoid potential infinite loop in tc_ctl_action()
  net: dsa: sja1105: Use the correct style for SPDX License Identifier
  tcp: fix a possible lockdep splat in tcp_done()
  arm: dts: mediatek: Update mt7629 dts to reflect the latest dt-binding
  net: ethernet: mediatek: Fix MT7629 missing GMII mode support
  Revert "Input: elantech - enable SMBus on new (2018+) systems"
  net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions
  net: avoid errors when trying to pop MLPS header on non-MPLS packets
  net: cavium: Use the correct style for SPDX License Identifier
  net: dsa: microchip: Use the correct style for SPDX License Identifier
  PCI: PM: Fix pci_power_up()
  xtensa: virt: fix PCI IO ports mapping
  libata/ahci: Fix PCS quirk application
  vfio/type1: Initialize resv_msi_base
  8250-men-mcb: fix error checking when get_num_ports returns -ENODEV
  USB: usblp: fix use-after-free on disconnect
  usb: udc: lpc32xx: fix bad bit shift operation
  usb: cdns3: Fix dequeue implementation.
  USB: legousbtower: fix a signedness bug in tower_probe()
  USB: legousbtower: fix memleak on disconnect
  USB: ldusb: fix memleak on disconnect
  net: ethernet: broadcom: have drivers select DIMLIB as needed
  net: Update address for vrf and l3mdev in MAINTAINERS
  net: bcmgenet: Set phydev->dev_flags only for internal PHYs
  blackhole_netdev: fix syzkaller reported issue
  ARM: dts: bcm2835-rpi-zero-w: Fix bus-width of sdhci
  sparc64: disable fast-GUP due to unexplained oopses
  btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
  drm/panfrost: Handle resetting on timeout better
  blk-rq-qos: fix first node deletion of rq_qos_del()
  blkcg: Fix multiple bugs in blkcg_activate_policy()
  xfs: change the seconds fields in xfs_bulkstat to signed
  tools headers UAPI: Sync sched.h with the kernel
  rbd: cancel lock_dwork if the wait is interrupted
  ceph: just skip unrecognized info in ceph_reply_info_extra
  tools headers kvm: Sync kvm.h headers with the kernel sources
  tools headers kvm: Sync kvm headers with the kernel sources
  tools headers kvm: Sync kvm headers with the kernel sources
  perf c2c: Fix memory leak in build_cl_output()
  perf tools: Fix mode setting in copyfile_mode_ns()
  perf annotate: Fix multiple memory and file descriptor leaks
  io_uring: consider the overflow of sequence for timeout req
  perf tools: Fix resource leak of closedir() on the error paths
  perf evlist: Fix fix for freed id arrays
  perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()
  scripts: setlocalversion: fix a bashism
  kbuild: update comment about KBUILD_ALLDIRS
  virtio-fs: don't show mount options
  nvme-tcp: fix possible leakage during error flow
  nvmet-loop: fix possible leakage during error flow
  btrfs: don't needlessly create extent-refs kernel thread
  iommu/amd: Fix incorrect PASID decoding from event log
  iommu/ipmmu-vmsa: Only call platform_get_irq() when interrupt is mandatory
  iommu/rockchip: Don't use platform_get_irq to implicitly count irqs
  dmaengine: sprd: Fix the possible memory leak issue
  dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_config
  dmaengine: xilinx_dma: Fix 64-bit simple AXIDMA transfer
  x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu
  x86/hyperv: Make vapic support x2apic mode
  KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
  arm64: hibernate: check pgd table allocation
  arm64: cpufeature: Treat ID_AA64ZFR0_EL1 as RAZ when SVE is not enabled
  net: aquantia: correctly handle macvlan and multicast coexistence
  net: aquantia: do not pass lro session with invalid tcp checksum
  net: aquantia: when cleaning hw cache it should be toggled
  net: aquantia: temperature retrieval fix
  gpio: lynxpoint: set default handler to be handle_bad_irq()
  gpio: merrifield: Move hardware initialization to callback
  gpio: lynxpoint: Move hardware initialization to callback
  gpio: intel-mid: Move hardware initialization to callback
  gpiolib: Initialize the hardware with a callback
  gpio: merrifield: Restore use of irq_base
  xtensa: drop EXPORT_SYMBOL for outs*/ins*
  mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once
  mm/slab.c: fix kernel-doc warning for __ksize()
  xarray.h: fix kernel-doc warning
  bitmap.h: fix kernel-doc warning and typo
  fs/fs-writeback.c: fix kernel-doc warning
  fs/libfs.c: fix kernel-doc warning
  fs/direct-io.c: fix kernel-doc warning
  mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()
  mm, hugetlb: allow hugepage allocations to reclaim as needed
  lib/test_meminit: add a kmem_cache_alloc_bulk() test
  mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations
  lib/generic-radix-tree.c: add kmemleak annotations
  mm/slub: fix a deadlock in show_slab_objects()
  mm, page_owner: rename flag indicating that page is allocated
  mm, page_owner: decouple freeing stack trace from debug_pagealloc
  mm, page_owner: fix off-by-one error in __set_page_owner_handle()
  xtensa: fix type conversion in __get_user_[no]check
  xtensa: clean up assembly arguments in uaccess macros
  block: Fix elv_support_iosched()
  parisc: Remove 32-bit DMA enforcement from sba_iommu
  parisc: Fix vmap memory leak in ioremap()/iounmap()
  parisc: prefer __section from compiler_attributes.h
  parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
  firmware: dmi: Fix unlikely out-of-bounds read in save_mem_devices
  riscv: tlbflush: remove confusing comment on local_flush_tlb_all()
  riscv: dts: HiFive Unleashed: add default chosen/stdout-path
  riscv: remove the switch statement in do_trap_break()
  drm/panfrost: Add missing GPU feature registers
  bpf: lwtunnel: Fix reroute supplying invalid dst
  xtensa: fix {get,put}_user() for 64bit values
  kmemleak: Do not corrupt the object_list during clean-up
  nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
  nvme: Wait for reset state when required
  nvme: Prevent resets during paused controller state
  nvme: Restart request timers in resetting state
  nvme: Remove ADMIN_ONLY state
  nvme-pci: Free tagset if no IO queues
  hrtimer: Annotate lockless access to timer->base
  staging: wlan-ng: fix exit return when sme->key_idx >= NUM_WEPKEYS
  ARM: imx_v6_v7_defconfig: Enable CONFIG_DRM_MSM
  arm64: dts: imx8mn: Use correct clock for usdhc's ipg clk
  arm64: dts: imx8mm: Use correct clock for usdhc's ipg clk
  arm64: dts: imx8mq: Use correct clock for usdhc's ipg clk
  platform/x86: i2c-multi-instantiate: Fail the probe if no IRQ provided
  ARM: dts: imx7s: Correct GPT's ipg clock source
  ARM: dts: vf610-zii-scu4-aib: Specify 'i2c-mux-idle-disconnect'
  drm/ttm: fix handling in ttm_bo_add_mem_to_lru
  drm/ttm: Restore ttm prefaulting
  drm/ttm: fix busy reference in ttm_mem_evict_first
  ARM: dts: imx6q-logicpd: Re-Enable SNVS power key
  ath10k: fix latency issue for QCA988x
  virtio-fs: Change module name to virtiofs.ko
  dmaengine: imx-sdma: fix size check for sdma script_number
  dmaengine: tegra210-adma: fix transfer failure
  arm64: dts: lx2160a: Correct CPU core idle state name
  dmaengine: sprd: Fix the link-list pointer register configuration issue
  batman-adv: Avoid free/alloc race when handling OGM buffer
  batman-adv: Avoid free/alloc race when handling OGM2 buffer
  netdevsim: Fix error handling in nsim_fib_init and nsim_fib_exit
  net/ibmvnic: Fix EOI when running in XIVE mode.
  net: lpc_eth: avoid resetting twice
  tcp: annotate sk->sk_wmem_queued lockless reads
  tcp: annotate sk->sk_sndbuf lockless reads
  tcp: annotate sk->sk_rcvbuf lockless reads
  tcp: annotate tp->urg_seq lockless reads
  tcp: annotate tp->snd_nxt lockless reads
  tcp: annotate tp->write_seq lockless reads
  tcp: annotate tp->copied_seq lockless reads
  tcp: annotate tp->rcv_nxt lockless reads
  tcp: add rcu protection around tp->fastopen_rsk
  vhost/test: stop device before reset
  tools/virtio: xen stub
  mailmap: Add Simon Arlott (replacement for expired email address)
  rxrpc: Fix possible NULL pointer access in ICMP handling
  drm/amdgpu/sdma5: fix mask value of POLL_REGMEM packet for pipe sync
  drm/amdgpu: Bail earlier when amdgpu.cik_/si_support is not set to 1
  Revert "drm/radeon: Fix EEH during kexec"
  Input: synaptics-rmi4 - avoid processing unknown IRQs
  btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
  drm/msm/dsi: Implement reset correctly
  Btrfs: add missing extents release on file extent cluster relocation error
  x86/boot/64: Round memory hole size up to next PMD page
  x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
  arm64: Fix kcore macros after 52-bit virtual addressing fallout
  tools/virtio: more stubs
  net/smc: receive pending data after RCV_SHUTDOWN
  net/smc: receive returns without data
  net/smc: fix SMCD link group creation with VLAN id
  net: update net_dim documentation after rename
  r8169: fix jumbo packet handling on resume from suspend
  arm64: dts: rockchip: Fix override mode for rk3399-kevin panel
  arm64: dts: rockchip: Fix usb-c on Hugsun X99 TV Box
  arm64: dts: rockchip: fix RockPro64 sdmmc settings
  ARM: 8914/1: NOMMU: Fix exc_ret for XIP
  ARM: 8908/1: add __always_inline to functions called from __get_user_check()
  HID: google: add magnemite/masterball USB ids
  MAINTAINERS: Add BCM2711 to BCM2835 ARCH
  dma-buf/resv: fix exclusive fence get
  drm/edid: Add 6 bpc quirk for SDC panel in Lenovo G50
  dm snapshot: rework COW throttling to fix deadlock
  dm snapshot: introduce account_start_copy() and account_end_copy()
  drm/tiny: Kconfig: Remove always-y THERMAL dep. from TINYDRM_REPAPER
  platform/x86: intel_punit_ipc: Avoid error message when retrieving IRQ
  platform/x86: classmate-laptop: remove unused variable
  opp: of: drop incorrect lockdep_assert_held()
  PM: sleep: include <linux/pm_runtime.h> for pm_wq
  cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown
  ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
  net: silence KCSAN warnings about sk->sk_backlog.len reads
  net: annotate sk->sk_rcvlowat lockless reads
  net: silence KCSAN warnings around sk_add_backlog() calls
  tcp: annotate lockless access to tcp_memory_pressure
  net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_head
  net: avoid possible false sharing in sk_leave_memory_pressure()
  tun: remove possible false sharing in tun_flow_update()
  netfilter: conntrack: avoid possible false sharing
  netns: fix NLM_F_ECHO mechanism for RTM_NEWNSID
  scsi: ch: Make it possible to open a ch device multiple times again
  scsi: fix kconfig dependency warning related to 53C700_LE_ON_BE
  scsi: sni_53c710: fix compilation error
  scsi: scsi_dh_alua: handle RTPG sense code correctly during state transitions
  scsi: qla2xxx: fix a potential NULL pointer dereference
  net: usb: qmi_wwan: add Telit 0x1050 composition
  act_mirred: Fix mirred_init_module error handling
  net: taprio: Fix returning EINVAL when configuring without flags
  s390/qeth: Fix initialization of vnicc cmd masks during set online
  s390/qeth: Fix error handling during VNICC initialization
  phylink: fix kernel-doc warnings
  sctp: add chunks to sk_backlog when the newsk sk_socket is not set
  bonding: fix potential NULL deref in bond_update_slave_arr
  net: stmmac: fix disabling flexible PPS output
  net: stmmac: fix length of PTP clock's name string
  ARM: mm: alignment: use "u32" for 32-bit instructions
  ARM: mm: fix alignment handler faults under memory pressure
  ARM: dts: Use level interrupt for omap4 & 5 wlcore
  drivers/amba: fix reset control error handling
  ASoC: simple_card_utils.h: Fix potential multiple redefinition error
  ASoC: msm8916-wcd-digital: add missing MIX2 path for RX1/2
  drm/amdgpu/powerplay: fix typo in mvdd table setup
  iwlwifi: pcie: change qu with jf devices to use qu configuration
  iwlwifi: exclude GEO SAR support for 3168
  iwlwifi: pcie: fix memory leaks in iwl_pcie_ctxt_info_gen3_init
  iwlwifi: dbg_ini: fix memory leak in alloc_sgtable
  iwlwifi: pcie: fix rb_allocator workqueue allocation
  iwlwifi: pcie: fix indexing in command dump for new HW
  iwlwifi: mvm: fix race in sync rx queue notification
  iwlwifi: mvm: force single phy init
  iwlwifi: fix ACPI table revision checks
  iwlwifi: don't access trans_cfg via cfg
  memstick: jmb38x_ms: Fix an error handling path in 'jmb38x_ms_probe()'
  mmc: sdhci-iproc: fix spurious interrupts on Multiblock reads with bcm2711
  pinctrl: armada-37xx: swap polarity on LED group
  arm64: dts: armada-3720-turris-mox: convert usb-phy to phy-supply
  ip6erspan: remove the incorrect mtu limit for ip6erspan
  Doc: networking/device_drivers/pensando: fix ionic.rst warnings
  NFC: pn533: fix use-after-free and memleaks
  Input: soc_button_array - partial revert of support for newer surface devices
  net_sched: fix backward compatibility for TCA_ACT_KIND
  net_sched: fix backward compatibility for TCA_KIND
  net/mlx5: DR, Allow insertion of duplicate rules
  selftests/bpf: More compatible nc options in test_lwt_ip_encap
  selftests/bpf: Set rp_filter in test_flow_dissector
  llc: fix sk_buff refcounting in llc_conn_state_process()
  llc: fix another potential sk_buff leak in llc_ui_sendmsg()
  llc: fix sk_buff leak in llc_conn_service()
  llc: fix sk_buff leak in llc_sap_state_process()
  dm clone: Make __hash_find static
  rt2x00: remove input-polldev.h header
  ARM: dts: am3874-iceboard: Fix 'i2c-mux-idle-disconnect' usage
  ARM: dts: omap5: fix gpu_cm clock provider name
  arm64: Allow CAVIUM_TX2_ERRATUM_219 to be selected
  arm64: Avoid Cavium TX2 erratum 219 when switching TTBR
  arm64: Enable workaround for Cavium TX2 erratum 219 when running SMT
  arm64: KVM: Trap VM ops when ARM64_WORKAROUND_CAVIUM_TX2_219_TVM is set
  mac80211: fix scan when operating on DFS channels in ETSI domains
  mac80211: accept deauth frames in IBSS mode
  cfg80211: fix a bunch of RCU issues in multi-bssid code
  nl80211: fix memory leak in nl80211_get_ftm_responder_stats
  ptp: fix typo of "mechanism" in Kconfig help text
  ionic: fix stats memory dereference
  ASoC: core: Fix pcm code debugfs error
  ARM: dts: sun7i: Drop the module clock from the device tree
  dt-bindings: media: sun4i-csi: Drop the module clock
  rxrpc: Fix call crypto state cleanup
  rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record
  rxrpc: Fix trace-after-put looking at the put call record
  rxrpc: Fix trace-after-put looking at the put connection record
  rxrpc: Fix trace-after-put looking at the put peer record
  media: dt-bindings: Fix building error for dt_binding_check
  rxrpc: Fix call ref leak
  ALSA: hdac: clear link output stream mapping
  ALSA: hda/realtek: Reduce the Headphone static noise on XPS 9350/9360
  lib: test_user_copy: style cleanup
  net: stmmac: selftests: Fix L2 Hash Filter test
  net: stmmac: gmac4+: Not all Unicast addresses may be available
  net: stmmac: selftests: Check if filtering is available before running
  net: dsa: b53: Do not clear existing mirrored port mask
  arm64: dts: zii-ultra: fix ARM regulator states
  soc: imx: imx-scu: Getting UID from SCU should have response
  pinctrl: stmfx: fix null pointer on remove
  pinctrl: iproc: allow for error from platform_get_irq()
  nvme: retain split access workaround for capability reads
  nvme: fix possible deadlock when nvme_update_formats fails
  pinctrl: ns2: Fix off by one bugs in ns2_pinmux_enable()
  pinctrl: bcm-iproc: Use SPDX header
  pinctrl: armada-37xx: fix control of pins 32 and up
  arm64: dts: rockchip: fix RockPro64 sdhci settings
  arm64: dts: rockchip: fix RockPro64 vdd-log regulator settings
  regulator: qcom-rpmh: Fix PMIC5 BoB min voltage
  ARM: dts: logicpd-torpedo-som: Remove twl_keypad
  cfg80211: wext: avoid copying malformed SSIDs
  mac80211: Reject malformed SSID elements
  mac80211_hwsim: fix incorrect dev_alloc_name failure goto
  MAINTAINERS: Add hp_sdc drivers to parisc arch
  scsi: MAINTAINERS: Update qla2xxx driver
  scsi: zfcp: fix reaction on bit error threshold notification
  scsi: core: save/restore command resid for error handling
  dt-bindings: arm: rockchip: fix Theobroma-System board bindings
  arm64: dts: rockchip: fix Rockpro64 RK808 interrupt line
  HID: Fix assumption that devices have inputs
  ARM: omap2plus_defconfig: Fix selected panels after generic panel changes
  samples/bpf: Add a workaround for asm_inline
  xsk: Fix crash in poll when device does not support ndo_xsk_wakeup
  samples/bpf: Fix build for task_fd_query_user.c
  ASoc: rockchip: i2s: Fix RPM imbalance
  mmc: sh_mmcif: Use platform_get_irq_optional() for optional interrupt
  mmc: renesas_sdhi: Do not use platform_get_irq() to count interrupts
  ACPI: HMAT: ACPI_HMAT_MEMORY_PD_VALID is deprecated since ACPI-6.3
  Input: goodix - add support for 9-bytes reports
  Input: da9063 - fix capability and drop KEY_SLEEP
  ASoC: wm_adsp: Don't generate kcontrols without READ flags
  sysfs: Fixes __BIN_ATTR_WO() macro
  rt2x00: initialize last_reset
  selftests/bpf: test_progs: Don't leak server_fd in test_sockopt_inherit
  selftests/bpf: test_progs: Don't leak server_fd in tcp_rtt
  regulator: pfuze100-regulator: Variable "val" in pfuze100_regulator_probe() could be uninitialized
  ASoC: intel: bytcr_rt5651: add null check to support_button_press
  ASoC: intel: sof_rt5682: add remove function to disable jack
  ASoC: rt5682: add NULL handler to set_jack function
  ASoC: intel: sof_rt5682: use separate route map for dmic
  ASoC: SOF: Intel: hda: Disable DMI L1 entry during capture
  ASoC: SOF: Intel: initialise and verify FW crash dump data.
  ASoC: SOF: Intel: hda: fix warnings during FW load
  ASoC: SOF: pcm: harden PCM STOP sequence
  ASoC: SOF: pcm: fix resource leak in hw_free
  ASoC: SOF: topology: fix parse fail issue for byte/bool tuple types
  ASoC: SOF: loader: fix kernel oops on firmware boot failure
  regulator: lochnagar: Add on_off_delay for VDDCORE
  ASoC: wm_adsp: Fix theoretical NULL pointer for alg_region
  pinctrl: cherryview: restore Strago DMI workaround for all versions
  pinctrl: intel: Allocate IRQ chip dynamic
  HID: prodikeys: make array keys static const, makes object smaller
  HID: fix error message in hid_open_report()
  ASoC: max98373: check for device node before parsing
  regulator: ti-abb: Fix timeout in ti_abb_wait_txdone/ti_abb_clear_all_txdone
  iommu/io-pgtable-arm: Support all Mali configurations
  iommu/io-pgtable-arm: Correct Mali attributes
  iommu/arm-smmu: Free context bitmap in the err path of arm_smmu_init_domain_context
  scsi: qla2xxx: Remove WARN_ON_ONCE in qla2x00_status_cont_entry()
  scsi: sd: Ignore a failure to sync cache due to lack of authorization
  arm64: dts: Fix gpio to pinmux mapping
  libbpf: handle symbol versioning properly for libbpf.a
  arm64: dts: allwinner: a64: sopine-baseboard: Add PHY regulator delay
  arm64: dts: allwinner: a64: Drop PMU node
  arm64: dts: allwinner: a64: pine64-plus: Add PHY regulator delay
  tools: bpf: Use !building_out_of_srctree to determine srctree
  ASoC: topology: Fix a signedness bug in soc_tplg_dapm_widget_create()
  scsi: core: fix dh and multipathing for SCSI hosts without request batching
  scsi: core: fix missing .cleanup_rq for SCSI hosts without request batching
  regulator: da9062: fix suspend_enable/disable preparation
  dt-bindings: fixed-regulator: fix compatible enum
  regulator: fixed: Prevent NULL pointer dereference when !CONFIG_OF
  ASoC: soc-component: fix a couple missing error assignments
  ASoC: wm8994: Do not register inapplicable controls for WM1811
  ASoC: samsung: arndale: Add missing OF node dereferencing
  irqchip/sifive-plic: Switch to fasteoi flow
  irqchip/gic-v3: Fix GIC_LINE_NR accessor
  regulator: core: make regulator_register() EPROBE_DEFER aware
  regulator: of: fix suspend-min/max-voltage parsing
  irqchip/atmel-aic5: Add support for sam9x60 irqchip
  irqchip/al-fic: Add support for irq retrigger

Change-Id: I5e7fd941c93a36889378f480cc27d8ea77d11b39
Signed-off-by: Raghavendra Rao Ananta <rananta@codeaurora.org>
2019-11-04 17:30:19 -08:00
Greg Kroah-Hartman
9be46ff3b6 Merge 5.4-rc4 into android-mainline
Linux 5.4-rc4

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0edccd72fad8b6443b24c8c1005b66d6b8f532ce
2019-10-26 19:24:41 +02:00
Prakash Gupta
8d8eacdb48 mm: run the showmem notifier in alloc failure
When the page allocation fails, it's useful to
be able to see the state of unaccounted memory in the system. Call the
showmem notifier to get other clients to dump out their state.

This is an example output with this patch.

  [  457.125478] SLUB: Unable to allocate memory on node -1, gfp=0x2008000(GFP_NOWAIT|__GFP_ZERO)
  [  457.133982]   cache: kmalloc-128, object size: 128, buffer size: 640, default order: 2, min order: 0
  [  457.143179]   node 0: slabs: 5903, objs: 132755, free: 26
  [  457.906076] BootAnimation: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
  [  457.916395] CPU: 2 PID: 4752 Comm: BootAnimation Not tainted 4.9.37+ #43
  [  457.916398] Hardware name: Qualcomm Technologies, Inc. SDM845 v1 MTP (DT)
  [  457.916402] Call trace:
  [  457.916420] [<ffffff82c5c89504>] dump_backtrace+0x0/0x2c4
  [  457.916426] [<ffffff82c5c897e8>] show_stack+0x20/0x28
  [  457.916434] [<ffffff82c5fea888>] dump_stack+0xb8/0xf4
  [  457.916442] [<ffffff82c5dd136c>] warn_alloc+0x154/0x170
  [  457.916447] [<ffffff82c5dd184c>] __alloc_pages_nodemask+0x430/0xcdc
  [  457.916454] [<ffffff82c5e1c154>] new_slab+0x344/0x430
  [  457.916458] [<ffffff82c5e1e404>] ___slab_alloc.constprop.72+0x2f4/0x398
  [  457.916463] [<ffffff82c5e1e4f0>] __slab_alloc.isra.69.constprop.71+0x48/0x80
  [  457.916467] [<ffffff82c5e1ea24>] kmem_cache_alloc_trace+0x210/0x2dc
  [  457.916476] [<ffffff82c68ebf0c>] binder_transaction+0x280/0x2008
  [  457.916480] [<ffffff82c68ee68c>] binder_thread_write+0x9f8/0x136c
  [  457.916484] [<ffffff82c68f08d0>] binder_ioctl_write_read+0x14c/0x3b0
  [  457.916488] [<ffffff82c68f0df0>] binder_ioctl+0x2bc/0x868
  [  457.916494] [<ffffff82c5e451e4>] do_vfs_ioctl+0xd0/0x858
  [  457.916498] [<ffffff82c5e459fc>] SyS_ioctl+0x90/0xa4
  [  457.916503] [<ffffff82c5c83770>] el0_svc_naked+0x24/0x28
  [  457.916505] Mem-Info:
  [  457.916515] active_anon:83629 inactive_anon:212 isolated_anon:0\x0a active_file:5955 inactive_file:5745 isolated_file:0\x0a unevictable:630956 dirty:0 writeback:0 unstable:0\x0a slab_recl
  aimable:16602 slab_unreclaimable:69384\x0a mapped:3609 shmem:308 pagetables:5737 bounce:0\x0a free:4446 free_pcp:482 free_cma:112
  [  457.916524] Node 0 active_anon:334516kB inactive_anon:848kB active_file:23820kB inactive_file:22980kB unevictable:2523824kB isolated(anon):0kB isolated(file):0kB mapped:14436kB dirty:0kB
  writeback:0kB shmem:1232kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
  [  457.916534] DMA free:12516kB min:3352kB low:4924kB high:6496kB active_anon:45336kB inactive_anon:64kB active_file:3588kB inactive_file:3404kB unevictable:1564224kB writepending:0kB presen
  t:1854120kB managed:1748064kB mlocked:1564224kB slab_reclaimable:10664kB slab_unreclaimable:24400kB kernel_stack:2608kB pagetables:6376kB bounce:0kB free_pcp:572kB local_pcp:0kB free_cma:448
  kB
  [  457.916536] lowmem_reserve[]: 0 1901 1901
  [  457.916550] Normal free:5268kB min:4148kB low:6092kB high:8036kB active_anon:289180kB inactive_anon:784kB active_file:20232kB inactive_file:19576kB unevictable:959600kB writepending:0kB p
  resent:2068224kB managed:1984196kB mlocked:959600kB slab_reclaimable:55744kB slab_unreclaimable:253136kB kernel_stack:19232kB pagetables:16572kB bounce:0kB free_pcp:1356kB local_pcp:116kB fr
  ee_cma:0kB
  [  457.916552] lowmem_reserve[]: 0 0 0
  [  457.916560] DMA: 819*4kB (UMEC) 280*8kB (UMEC) 224*16kB (UME) 98*32kB (UME) 6*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12620kB
  [  457.916594] Normal: 1118*4kB (UMEH) 38*8kB (UMH) 1*16kB (H) 2*32kB (H) 1*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5048kB
  [  457.916627] 12071 total pagecache pages
  [  457.916630] 0 pages in swap cache
  [  457.916633] Swap cache stats: add 0, delete 0, find 0/0
  [  457.916634] Free swap  = 0kB
  [  457.916636] Total swap = 0kB
  [  457.916639] 980586 pages RAM
  [  457.916641] 0 pages HighMem/MovableOnly
  [  457.916643] 47521 pages reserved
  [  457.916645] 51200 pages cma reserved
  [  457.916651] cma: cma-0 pages: => 0 used of 2048 total pages
  [  457.916660] cma: cma-1 pages: => 0 used of 23552 total pages
  [  457.916665] cma: cma-2 pages: => 1695 used of 3072 total pages
  [  457.916670] cma: cma-3 pages: => 8277 used of 9216 total pages
  [  457.916674] cma: cma-4 pages: => 186 used of 5120 total pages
  [  457.916679] cma: cma-5 pages: => 3792 used of 8192 total pages
  [  457.916685]        Heap name  Total heap size Total orphaned size
  [  457.916687] ---------------------------------
  [  457.916691]           qsecom 0x           ba000 0x               0
  [  457.916694]           system 0x         44db000 0x          500000
  [  457.916705] -------------------------------------------------
  [  457.916708] uncached pool = 31027200 cached pool = 0 secure pool = 0
  [  457.916710] pool total (uncached + cached + secure) = 31027200
  [  457.916712] -------------------------------------------------
  [  457.916715]             adsp 0x          614000 0x               0
  [  457.916720]             spss 0x               0 0x               0
  [  457.916725]   secure_display 0x               0 0x               0
  [  457.916727]      secure_heap 0x               0 0x               0.

Change-Id: Id01cce4abf331ff9c1c7ab9f0c0f9b1fc4146467
Signed-off-by: Prakash Gupta <guptap@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
2019-10-24 10:21:16 -07:00
Greg Kroah-Hartman
630839ac24 Merge 5.4-rc3 into android-mainline
Linux 5.4-rc3

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia87ba662738dd58ddb917e32c1fbd812861e7a46
2019-10-17 05:28:13 -07:00
David Rientjes
3f36d86694 mm, hugetlb: allow hugepage allocations to reclaim as needed
Commit b39d0ee263 ("mm, page_alloc: avoid expensive reclaim when
compaction may not succeed") has chnaged the allocator to bail out from
the allocator early to prevent from a potentially excessive memory
reclaim.  __GFP_RETRY_MAYFAIL is designed to retry the allocation,
reclaim and compaction loop as long as there is a reasonable chance to
make forward progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.

The most obvious affected subsystem is hugetlbfs which allocates huge
pages based on an admin request (or via admin configured overcommit).  I
have done a simple test which tries to allocate half of the memory for
hugetlb pages while the memory is full of a clean page cache.  This is
not an unusual situation because we try to cache as much of the memory
as possible and sysctl/sysfs interface to allocate huge pages is there
for flexibility to allocate hugetlb pages at any time.

System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
after the memory is prefilled by a clean page cache:

  root@test1:~# cat hugetlb_test.sh

  set -x
  echo 0 > /proc/sys/vm/nr_hugepages
  echo 3 > /proc/sys/vm/drop_caches
  echo 1 > /proc/sys/vm/compact_memory
  dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
  TS=$(date +%s)
  echo 256 > /proc/sys/vm/nr_hugepages
  cat /proc/sys/vm/nr_hugepages

The results for 2 consecutive runs on clean 5.3

  root@test1:~# sh hugetlb_test.sh
  + echo 0
  + echo 3
  + echo 1
  + dd if=/mnt/data/file-1G of=/dev/null bs=4096
  262144+0 records in
  262144+0 records out
  1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
  + date +%s
  + TS=1569905284
  + echo 256
  + cat /proc/sys/vm/nr_hugepages
  256
  root@test1:~# sh hugetlb_test.sh
  + echo 0
  + echo 3
  + echo 1
  + dd if=/mnt/data/file-1G of=/dev/null bs=4096
  262144+0 records in
  262144+0 records out
  1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
  + date +%s
  + TS=1569905311
  + echo 256
  + cat /proc/sys/vm/nr_hugepages
  256

Now with b39d0ee263 applied

  root@test1:~# sh hugetlb_test.sh
  + echo 0
  + echo 3
  + echo 1
  + dd if=/mnt/data/file-1G of=/dev/null bs=4096
  262144+0 records in
  262144+0 records out
  1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
  + date +%s
  + TS=1569905516
  + echo 256
  + cat /proc/sys/vm/nr_hugepages
  11
  root@test1:~# sh hugetlb_test.sh
  + echo 0
  + echo 3
  + echo 1
  + dd if=/mnt/data/file-1G of=/dev/null bs=4096
  262144+0 records in
  262144+0 records out
  1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
  + date +%s
  + TS=1569905541
  + echo 256
  + cat /proc/sys/vm/nr_hugepages
  12

The success rate went down by factor of 20!

Although hugetlb allocation requests might fail and it is reasonable to
expect them to under extremely fragmented memory or when the memory is
under a heavy pressure but the above situation is not that case.

Fix the regression by reverting back to the previous behavior for
__GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
those requests.

Mike said:

: hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
: boot where this may not be as much of an issue.  However, I am aware of at
: least three use cases where allocations are made after the system has been
: up and running for quite some time:
:
: - DB reconfiguration.  If sysctl/sysfs fails to get required number of
:   huge pages, system is rebooted to perform allocation after boot.
:
: - VM provisioning.  If unable get required number of huge pages, fall
:   back to base pages.
:
: - An application that does not preallocate pool, but rather allocates
:   pages at fault time for optimal NUMA locality.
:
: In all cases, I would expect b39d0ee263 to cause regressions and
: noticable behavior changes.
:
: My quick/limited testing in
: https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
: was insufficient.  It was also mentioned that if something like
: b39d0ee263 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
: requests as in this patch.

[mhocko@suse.com: reworded changelog]
Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
Fixes: b39d0ee263 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14 15:04:01 -07:00
Qian Cai
234fdce892 mm/page_alloc.c: fix a crash in free_pages_prepare()
On architectures like s390, arch_free_page() could mark the page unused
(set_page_unused()) and any access later would trigger a kernel panic.
Fix it by moving arch_free_page() after all possible accessing calls.

 Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
 Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
 Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
            000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
            00000000275cca48 0000000000000100 0000000000000008 000003d080010000
            00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
 Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
            0000000026c2b962: e32014000036 pfd 2,1024(%r1)
           #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
           >0000000026c2b96e: 41101100  la %r1,256(%r1)
            0000000026c2b972: a737fff8  brctg %r3,26c2b962
            0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
            0000000026c2b97c: e31003400004 lg %r1,832
            0000000026c2b982: ebff1430016a asi 5168(%r1),-1
 Call Trace:
 __free_pages_ok+0x16a/0x5d8)
 memblock_free_all+0x206/0x290
 mem_init+0x58/0x120
 start_kernel+0x2b0/0x570
 startup_continue+0x6a/0xc0
 INFO: lockdep is turned off.
 Last Breaking-Event-Address:
 __free_pages_ok+0x372/0x5d8
 Kernel panic - not syncing: Fatal exception: panic_on_oops
 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C

In the past, only kernel_poison_pages() would trigger this but it needs
"page_poison=on" kernel cmdline, and I suspect nobody tested that on
s390.  Recently, kernel_init_free_pages() (commit 6471384af2 ("mm:
security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
was added and could trigger this as well.

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
Fixes: 8823b1dbc0 ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
Fixes: 6471384af2 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: <stable@vger.kernel.org>	[5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 15:47:19 -07:00
Greg Kroah-Hartman
cb33d78781 Merge 5.4-rc1 into android-mainline
Linux 5.4-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I15eec52df70f829acf81ff614a1c2a5fb443a4e0
2019-10-02 19:10:07 +02:00
Greg Kroah-Hartman
94139142d9 Merge 5.4-rc1-prelrease into android-mainline
To make the 5.4-rc1 merge easier, merge at a prerelease point in time
before the final release happens.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: If613d657fd0abf9910c5bf3435a745f01b89765e
2019-10-02 17:58:47 +02:00
Linus Torvalds
edf445ad7c Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)
Merge hugepage allocation updates from David Rientjes:
 "We (mostly Linus, Andrea, and myself) have been discussing offlist how
  to implement a sane default allocation strategy for hugepages on NUMA
  platforms.

  With these reverts in place, the page allocator will happily allocate
  a remote hugepage immediately rather than try to make a local hugepage
  available. This incurs a substantial performance degradation when
  memory compaction would have otherwise made a local hugepage
  available.

  This series reverts those reverts and attempts to propose a more sane
  default allocation strategy specifically for hugepages. Andrea
  acknowledges this is likely to fix the swap storms that he originally
  reported that resulted in the patches that removed __GFP_THISNODE from
  hugepage allocations.

  The immediate goal is to return 5.3 to the behavior the kernel has
  implemented over the past several years so that remote hugepages are
  not immediately allocated when local hugepages could have been made
  available because the increased access latency is untenable.

  The next goal is to introduce a sane default allocation strategy for
  hugepages allocations in general regardless of the configuration of
  the system so that we prevent thrashing of local memory when
  compaction is unlikely to succeed and can prefer remote hugepages over
  remote native pages when the local node is low on memory."

Note on timing: this reverts the hugepage VM behavior changes that got
introduced fairly late in the 5.3 cycle, and that fixed a huge
performance regression for certain loads that had been around since
4.18.

Andrea had this note:

 "The regression of 4.18 was that it was taking hours to start a VM
  where 3.10 was only taking a few seconds, I reported all the details
  on lkml when it was finally tracked down in August 2018.

     https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/

  __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
  workload degrade like in the "current upstream" above. And it still
  would have been that bad as above until 5.3-rc5"

where the bad behavior ends up happening as you fill up a local node,
and without that change, you'd get into the nasty swap storm behavior
due to compaction working overtime to make room for more memory on the
nodes.

As a result 5.3 got the two performance fix reverts in rc5.

However, David Rientjes then noted that those performance fixes in turn
regressed performance for other loads - although not quite to the same
degree.  He suggested reverting the reverts and instead replacing them
with two small changes to how hugepage allocations are done (patch
descriptions rephrased by me):

 - "avoid expensive reclaim when compaction may not succeed": just admit
   that the allocation failed when you're trying to allocate a huge-page
   and compaction wasn't successful.

 - "allow hugepage fallback to remote nodes when madvised": when that
   node-local huge-page allocation failed, retry without forcing the
   local node.

but by then I judged it too late to replace the fixes for a 5.3 release.
So 5.3 was released with behavior that harked back to the pre-4.18 logic.

But now we're in the merge window for 5.4, and we can see if this
alternate model fixes not just the horrendous swap storm behavior, but
also restores the performance regression that the late reverts caused.

Fingers crossed.

* emailed patches from David Rientjes <rientjes@google.com>:
  mm, page_alloc: allow hugepage fallback to remote nodes when madvised
  mm, page_alloc: avoid expensive reclaim when compaction may not succeed
  Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
  Revert "Revert "mm, thp: restore node-local hugepage allocations""
2019-09-28 14:26:47 -07:00
David Rientjes
b39d0ee263 mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:

 - isolate_freepages() is responsible for finding free pages to use as
   migration targets and is implemented as a linear scan of memory
   starting at the end of a zone,

 - failing order-0 watermark checks in memory compaction does not account
   for how far below the watermarks the zone actually is: to enable
   migration, there must be *some* free memory available.  Per the above,
   watermarks are not always suffficient if isolate_freepages() cannot
   find the free memory but it could require hundreds of MBs of reclaim to
   even reach this threshold (read: potentially very expensive reclaim with
   no indication compaction can be successful), and

 - if compaction at this order has failed recently so that it does not even
   run as a result of deferred compaction, looping through reclaim can often
   be pointless.

For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.

Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.

It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory.  If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-28 14:05:38 -07:00
Yang Shi
7ae88534cd mm: move mem_cgroup_uncharge out of __page_cache_release()
A later patch makes THP deferred split shrinker memcg aware, but it needs
page->mem_cgroup information in THP destructor, which is called after
mem_cgroup_uncharge() now.

So move mem_cgroup_uncharge() from __page_cache_release() to compound page
destructor, which is called by both THP and other compound pages except
HugeTLB.  And call it in __put_single_page() for single order page.

Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Yang Shi
364c1eebe4 mm: thp: extract split_queue_* into a struct
Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab.  It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM even
though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Vlastimil Babka
4943308556 mm, compaction: raise compaction priority after it withdrawns
Mike Kravetz reports that "hugetlb allocations could stall for minutes or
hours when should_compact_retry() would return true more often then it
should.  Specifically, this was in the case where compact_result was
COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
made."

The problem is that the compaction_withdrawn() test in
should_compact_retry() includes compaction outcomes that are only possible
on low compaction priority, and results in a retry without increasing the
priority.  This may result in furter reclaim, and more incomplete
compaction attempts.

With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.

The COMPACT_SKIPPED result doesn't really fit together with the other
outcomes in compaction_withdrawn(), as that's a result caused by
insufficient order-0 pages, not due to low compaction priority.  With this
patch, it is moved to a new compaction_needs_reclaim() function, and for
that outcome we keep the current logic of retrying if it looks like
reclaim will be able to help.

Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:10 -07:00
Matthew Wilcox (Oracle)
d8c6546b1a mm: introduce compound_nr()
Replace 1 << compound_order(page) with compound_nr(page).  Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Greg Kroah-Hartman
896be8f44d Merge 5.4-rc1-prereleae into android-mainline
To make the 5.4-rc1 merge easier, merge at a prerelease point in time
before the final release happens.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I29b683c837ed1a3324644dbf9bf863f30740cd0b
2019-09-23 14:14:08 +02:00
Linus Torvalds
84da111de0 Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
 "This is more cleanup and consolidation of the hmm APIs and the very
  strongly related mmu_notifier interfaces. Many places across the tree
  using these interfaces are touched in the process. Beyond that a
  cleanup to the page walker API and a few memremap related changes
  round out the series:

   - General improvement of hmm_range_fault() and related APIs, more
     documentation, bug fixes from testing, API simplification &
     consolidation, and unused API removal

   - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
     and make them internal kconfig selects

   - Hoist a lot of code related to mmu notifier attachment out of
     drivers by using a refcount get/put attachment idiom and remove the
     convoluted mmu_notifier_unregister_no_release() and related APIs.

   - General API improvement for the migrate_vma API and revision of its
     only user in nouveau

   - Annotate mmu_notifiers with lockdep and sleeping region debugging

  Two series unrelated to HMM or mmu_notifiers came along due to
  dependencies:

   - Allow pagemap's memremap_pages family of APIs to work without
     providing a struct device

   - Make walk_page_range() and related use a constant structure for
     function pointers"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
  libnvdimm: Enable unit test infrastructure compile checks
  mm, notifier: Catch sleeping/blocking for !blockable
  kernel.h: Add non_block_start/end()
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()
  csky: add missing brackets in a macro for tlb.h
  pagewalk: use lockdep_assert_held for locking validation
  pagewalk: separate function pointers from iterator data
  mm: split out a new pagewalk.h header from mm.h
  mm/mmu_notifiers: annotate with might_sleep()
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
  mm/hmm: hmm_range_fault() infinite loop
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
  mm/mmu_notifiers: remove unregister_no_release
  RDMA/odp: remove ib_ucontext from ib_umem
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  ...
2019-09-21 10:07:42 -07:00
Greg Kroah-Hartman
bfa0399bc8 Merge Linus's 5.4-rc1-prerelease branch into android-mainline
This merges Linus's tree as of commit b41dae061b ("Merge tag
'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")
into android-mainline.

This "early" merge makes it easier to test and handle merge conflicts
instead of having to wait until the "end" of the merge window and handle
all 10000+ commits at once.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6bebf55e5e2353f814e3c87f5033607b1ae5d812
2019-09-20 16:07:54 -07:00
Linus Torvalds
7e67a85999 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq->lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
2019-09-16 17:25:49 -07:00
Matt Fleming
a55c7454a8 sched/topology: Improve load balancing on AMD EPYC systems
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

  commit 32e45ff43e ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

 node distances:
 node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:37 +02:00
Greg Kroah-Hartman
ad455d87e5 Merge 5.3-rc6 into android-mainline
Linux 5.3-rc6

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id10580d48d56054408b3efe0bd1866d67aba2a3d
2019-08-26 16:45:30 +02:00
David Rientjes
cd96103838 mm, page_alloc: move_freepages should not examine struct page of reserved memory
After commit 907ec5fca3 ("mm: zero remaining unavailable struct
pages"), struct page of reserved memory is zeroed.  This causes
page->flags to be 0 and fixes issues related to reading
/proc/kpageflags, for example, of reserved memory.

The VM_BUG_ON() in move_freepages_block(), however, assumes that
page_zone() is meaningful even for reserved memory.  That assumption is
no longer true after the aforementioned commit.

There's no reason why move_freepages_block() should be testing the
legitimacy of page_zone() for reserved memory; its scope is limited only
to pages on the zone's freelist.

Note that pfn_valid() can be true for reserved memory: there is a
backing struct page.  The check for page_to_nid(page) is also buggy but
reserved memory normally only appears on node 0 so the zeroing doesn't
affect this.

Move the debug checks to after verifying PageBuddy is true.  This
isolates the scope of the checks to only be for buddy pages which are on
the zone's freelist which move_freepages_block() is operating on.  In
this case, an incorrect node or zone is a bug worthy of being warned
about (and the examination of struct page is acceptable bcause this
memory is not reserved).

Why does move_freepages_block() gets called on reserved memory? It's
simply math after finding a valid free page from the per-zone free area
to use as fallback.  We find the beginning and end of the pageblock of
the valid page and that can bring us into memory that was reserved per
the e820.  pfn_valid() is still true (it's backed by a struct page), but
since it's zero'd we shouldn't make any inferences here about comparing
its node or zone.  The current node check just happens to succeed most
of the time by luck because reserved memory typically appears on node 0.

The fix here is to validate that we actually have buddy pages before
testing if there's any type of zone or node strangeness going on.

We noticed it almost immediately after bringing 907ec5fca3 in on
CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
the per-zone free area where the math in move_freepages() will bring the
start or end pfn into reserved memory and wanting to claim that entire
pageblock as a new migratetype.  So the path will be rare, require
CONFIG_DEBUG_VM, and require fallback to a different migratetype.

Some struct pages were already zeroed from reserve pages before
907ec5fca3c so it theoretically could trigger before this commit.  I
think it's rare enough under a config option that most people don't run
that others may not have noticed.  I wouldn't argue against a stable tag
and the backport should be easy enough, but probably wouldn't single out
a commit that this is fixing.

Mel said:

: The overhead of the debugging check is higher with this patch although
: it'll only affect debug builds and the path is not particularly hot.
: If this was a concern, I think it would be reasonable to simply remove
: the debugging check as the zone boundaries are checked in
: move_freepages_block and we never expect a zone/node to be smaller than
: a pageblock and stuck in the middle of another zone.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
Christoph Hellwig
fdc029b19d memremap: remove the dev field in struct dev_pagemap
The dev field in struct dev_pagemap is only used to print dev_name in two
places, which are at best nice to have.  Just remove the field and thus
the name in those two messages.

Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Greg Kroah-Hartman
37766c2946 Merge 5.3.0-rc1 into android-mainline
Linus 5.3-rc1 release

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic171e37d4c21ffa495240c5538852bbb5a9dcce8
2019-07-23 16:21:59 -07:00
Dan Williams
ba72b4c8cf mm/sparsemem: support sub-section hotplug
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.

For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
      2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
     devm_memremap_pages+0x460/0x6e0
     pmem_attach_disk+0x29e/0x680 [nd_pmem]
     ? nd_dax_probe+0xfc/0x120 [libnvdimm]
     nvdimm_bus_probe+0x66/0x160 [libnvdimm]

It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.

A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation.  Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.

Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.

[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

[osalvador@suse.de: fix deactivate_section for early sections]
  Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Dan Williams
46d945aeab mm: kill is_dev_zone() helper
Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Dan Williams
f46edbd1b1 mm/sparsemem: add helpers track active portions of a section at boot
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.

The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask.  The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.

The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data.  So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.

Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Jane Chu <jane.chu@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Dan Williams
f1eca35a0d mm/sparsemem: introduce struct mem_section_usage
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular.  Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem.  However, it does not use the
'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
never exposes the userspace memblock interface for pmem.  This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next.  Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages().  Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

 - Sections that describe boot memory (early sections) are never
   unplugged / removed.

 - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
   valid_section() check

 - __add_pages() and helper routines assume all operations occur in
   PAGES_PER_SECTION units.

 - The memblock sysfs interface only comprehends full sections

New design assumptions:

 - Sections are instrumented with a sub-section bitmask to track (on
   x86) individual 2MB sub-divisions of a 128MB section.

 - Partially populated early sections can be extended with additional
   sub-sections, and those sub-sections can be removed with
   arch_remove_memory(). With this in place we no longer lose usable
   memory capacity to padding.

 - pfn_valid() is updated to look deeper than valid_section() to also
   check the active-sub-section mask. This indication is in the same
   cacheline as the valid_section() so the performance impact is
   expected to be negligible. So far the lkp robot has not reported any
   regressions.

 - Outside of the core vmemmap population routines which are replaced,
   other helper routines like shrink_{zone,pgdat}_span() are updated to
   handle the smaller granularity. Core memory hotplug routines that
   deal with online memory are not touched.

 - The existing memblock sysfs user api guarantees / assumptions are not
   touched since this capability is limited to !online
   !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Yafang Shao
e5ca8071fe mm/vmscan.c: add a new member reclaim_state in struct shrink_control
Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths".

This patchset is to fix the issues in doing shrink slab.

There're six different reclaim paths by now,
 - kswapd reclaim path
 - node reclaim path
 - hibernate preallocate memory reclaim path
 - direct reclaim path
 - memcg reclaim path
 - memcg softlimit reclaim path

The slab caches reclaimed in these paths are only calculated in the
above three paths.  The issues are detailed explained in patch #2.  We
should calculate the reclaimed slab caches in every reclaim path.  In
order to do it, the struct reclaim_state is placed into the struct
shrink_control.

In node reclaim path, there'is another issue about shrinking slab, which
is adressed in "mm/vmscan: shrink slab in node reclaim"
(https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/).

This patch (of 2):

The struct reclaim_state is used to record how many slab caches are
reclaimed in one reclaim path.  The struct shrink_control is used to
control one reclaim path.  So we'd better put reclaim_state into
shrink_control.

[laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()]
Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com
Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:21 -07:00
Linus Torvalds
fec88ab0af Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00
Alexander Potapenko
6471384af2 mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.

Provide init_on_alloc and init_on_free boot options.

These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.

Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.

Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.

As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations.  There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.

This patch (of 2):

The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.

This is expected to be on-by-default on Android and Chrome OS.  And it
gives the opportunity for anyone else to use it under distros too via the
boot args.  (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)

init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes.  Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.

init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion.  This helps to ensure sensitive data
doesn't leak via use-after-free accesses.

Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory.  The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
zero-initialized to preserve their semantics.

Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.

If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.

Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.

The new features are also going to pave the way for hardware memory
tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects.  With MTE, tagging will have the
same cost as memory initialization.

Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized.  There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.

[glider@google.com: v8]
  Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
  Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
  Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:46 -07:00
Nicholas Piggin
e03a5125ec mm/large system hash: clear hashdist when only one node with memory is booted
CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
when booting on single node machines.  This causes the large system hashes
to be allocated with vmalloc, and mapped with small pages.

This change clears hashdist if only one node has come up with memory.

This results in the important large inode and dentry hashes using memblock
allocations.  All others are within 4MB size up to about 128GB of RAM,
which allows them to be allocated from the linear map on most non-NUMA
images.

Other big hashes like futex and TCP should eventually be moved over to the
same style of allocation as those vfs caches that use HASH_EARLY if
!hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.

This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
(performance is in the noise, under 1% difference, page tables are likely
to be well cached for this workload).

Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:46 -07:00
Nicholas Piggin
ec11408a16 mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
The kernel currently clamps large system hashes to MAX_ORDER when hashdist
is not set, which is rather arbitrary.

vmalloc space is limited on 32-bit machines, but this shouldn't result in
much more used because of small physical memory limiting system hash
sizes.

Include "vmalloc" or "linear" in the kernel log message.

Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:46 -07:00
Vlastimil Babka
3972f6bb1c mm, debug_pagealloc: use a page type instead of page_ext flag
When debug_pagealloc is enabled, we currently allocate the page_ext
array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag.  Now that
we have the page_type field in struct page, we can use that instead, as
guard pages are neither PageSlab nor mapped to userspace.  This reduces
memory overhead when debug_pagealloc is enabled and there are no other
features requiring the page_ext array.

Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
Vlastimil Babka
4462b32c92 mm, page_alloc: more extensive free page checking with debug_pagealloc
The page allocator checks struct pages for expected state (mapcount,
flags etc) as pages are being allocated (check_new_page()) and freed
(free_pages_check()) to provide some defense against errors in page
allocator users.

Prior commits 479f854a20 ("mm, page_alloc: defer debugging checks of
pages allocated from the PCP") and 4db7548ccb ("mm, page_alloc: defer
debugging checks of freed pages until a PCP drain") this has happened
for order-0 pages as they were allocated from or freed to the per-cpu
caches (pcplists).  Since those are fast paths, the checks are now
performed only when pages are moved between pcplists and global free
lists.  This however lowers the chances of catching errors soon enough.

In order to increase the chances of the checks to catch errors, the
kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
multiple other internal debug checks (VM_BUG_ON() etc), which is
suboptimal when the goal is to catch errors in mm users, not in mm code
itself.

To catch some wrong users of the page allocator we have
CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
unless enabled at boot time.  Memory corruptions when writing to freed
pages have often the same underlying errors (use-after-free, double free)
as corrupting the corresponding struct pages, so this existing debugging
functionality is a good fit to extend by also perform struct page checks
at least as often as if CONFIG_DEBUG_VM was enabled.

Specifically, after this patch, when debug_pagealloc is enabled on boot,
and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
freed to the pcplists *in addition* to being moved between pcplists and
free lists.  When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
pages are checked when being moved between pcplists and free lists *in
addition* to when allocated from or freed to the pcplists.

When debug_pagealloc is not enabled on boot, the overhead in fast paths
should be virtually none thanks to the use of static key.

Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
Vlastimil Babka
96a2b03f28 mm, debug_pagelloc: use static keys to enable debugging
Patch series "debug_pagealloc improvements".

I have been recently debugging some pcplist corruptions, where it would be
useful to perform struct page checks immediately as pages are allocated
from and freed to pcplists, which is now only possible by rebuilding the
kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).

To make this kind of debugging simpler in future on a distro kernel, I
have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
and extended it to perform the struct page checks more often when enabled
(Patch 2).  Now it can be configured in when building a distro kernel
without extra overhead, and debugging page use after free or double free
can be enabled simply by rebooting with debug_pagealloc=on.

This patch (of 3):

CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f
("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
allow being always enabled in a distro kernel, but only perform its
expensive functionality when booted with debug_pagelloc=on.  We can
further reduce the overhead when not boot-enabled (including page
allocator fast paths) using static keys.  This patch introduces one for
debug_pagealloc core functionality, and another for the optional guard
page functionality (enabled by booting with debug_guardpage_minorder=X).

Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
Denis Efremov
98ef2046f2 mm: remove the exporting of totalram_pages
Previously totalram_pages was the global variable.  Currently,
totalram_pages is the static inline function from the include/linux/mm.h
However, the function is also marked as EXPORT_SYMBOL, which is at best an
odd combination.  Because there is no point for the static inline function
from a public header to be exported, this commit removes the
EXPORT_SYMBOL() marking.  It will be still possible to use the function in
modules because all the symbols it depends on are exported.

Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
Fixes: ca79b0c211 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
Signed-off-by: Denis Efremov <efremov@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
Greg Kroah-Hartman
a4bbf3df04 Merge 5.2 into android-common
Linux 5.2

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-07-08 08:24:40 +02:00
Juergen Gross
b9705d8778 mm/page_alloc.c: fix regression with deferred struct page init
Commit 0e56acae4b ("mm: initialize MAX_ORDER_NR_PAGES at a time
instead of doing larger sections") is causing a regression on some
systems when the kernel is booted as Xen dom0.

The system will just hang in early boot.

Reason is an endless loop in get_page_from_freelist() in case the first
zone looked at has no free memory.  deferred_grow_zone() is always
returning true due to the following code snipplet:

  /* If the zone is empty somebody else may have cleared out the zone */
  if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
                                           first_deferred_pfn)) {
          pgdat->first_deferred_pfn = ULONG_MAX;
          pgdat_resize_unlock(pgdat, &flags);
          return true;
  }

This in turn results in the loop as get_page_from_freelist() is assuming
forward progress can be made by doing some more struct page
initialization.

Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com
Fixes: 0e56acae4b ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections")
Signed-off-by: Juergen Gross <jgross@suse.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 11:12:07 +09:00
Christoph Hellwig
8a164fef9c mm: simplify ZONE_DEVICE page private data
Remove the clumsy hmm_devmem_page_{get,set}_drvdata helpers, and
instead just access the page directly.  Also make the page data
a void pointer, and thus much easier to use.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:45 -03:00
Christoph Hellwig
514caf23a7 memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible.  Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Greg Kroah-Hartman
b9c482880a Merge 5.2-rc2 into android-mainline
Linux 5.2-rc2

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-05-27 09:45:14 +02:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Greg Kroah-Hartman
1226c72a32 Merge 5.2-rc1 into android-mainline
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-05-20 20:17:24 +02:00
Dan Williams
97500a4a54 mm: maintain randomization of page free lists
When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time.  Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Dan Williams
b03641af68 mm: move buddy list manipulations into helpers
In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions.  Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
  Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
  Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Dan Williams
e900a918b0 mm: shuffle initial free memory to improve memory-side-cache utilization
Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache.  Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms.  In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e5
("x86/numa_emulation: Introduce uniform split capability").  With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes.  A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable.  While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts.  Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
   3X speedup in a contrived case that tries to force cache conflicts.
   The contrived cased used the numa_emulation capability to force an
   instance of the benchmark to be run in two of the near-memory sized
   numa nodes.  If both instances were placed on the same emulated they
   would fit and cause zero conflicts.  While on separate emulated nodes
   without randomization they underutilized the cache and conflicted
   unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
   size that exceeded cache size by 3X.  The cache conflict rate was 8%
   for the first run and degraded to 21% after page allocator aging.  With
   randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
   cache-conflict rates, but the overall throughput dropped by 7% with
   randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
   enabled on platforms without a memory-side-cache and saw a mix of some
   improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads.  For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization.  Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects.  The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages.  Pages are freed in physical address order when first
onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time.  Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions.  It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages.  At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent.  As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
  Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
  Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Baruch Siach
136ac591f0 mm: update references to page _refcount
Commit 0139aa7b7f ("mm: rename _count, field of the struct page, to
_refcount") left out a couple of references to the old field name.  Fix
that.

Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
Fixes: 0139aa7b7f ("mm: rename _count, field of the struct page, to _refcount")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Mike Rapoport
350e88bad4 mm: memblock: make keeping memblock memory opt-in rather than opt-out
Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.

Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Yafang Shao
1c52e6d068 mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
Because rmqueue_pcplist() is only called when order is 0, we don't need to
use order as a parameter.

Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Michal Hocko
5557c766ab mm, memory_hotplug: cleanup memory offline path
check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range.  This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes, as pages marked
as vmemmap won't be isolated.

Move the offlined pages counting to offline_isolated_pages_cb and rely on
__offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00