Prabhakar reported an OOPS inside mem_cgroup_get_nr_swap_pages()
function in a corner case seen on some arm64 boards when kdump kernel
runs with "cgroup_disable=memory" passed to the kdump kernel via
bootargs.
The root-cause behind the same is that currently mem_cgroup_swap_init()
function is implemented as a subsys_initcall() call instead of a
core_initcall(), this means 'cgroup_memory_noswap' still remains set to
the default value (false) even when memcg is disabled via
"cgroup_disable=memory" boot parameter.
This may result in premature OOPS inside mem_cgroup_get_nr_swap_pages()
function in corner cases:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000188
Mem abort info:
ESR = 0x96000006
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
[0000000000000188] user address but active_mm is swapper
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
<..snip..>
Call trace:
mem_cgroup_get_nr_swap_pages+0x9c/0xf4
shrink_lruvec+0x404/0x4f8
shrink_node+0x1a8/0x688
do_try_to_free_pages+0xe8/0x448
try_to_free_pages+0x110/0x230
__alloc_pages_slowpath.constprop.106+0x2b8/0xb48
__alloc_pages_nodemask+0x2ac/0x2f8
alloc_page_interleave+0x20/0x90
alloc_pages_current+0xdc/0xf8
atomic_pool_expand+0x60/0x210
__dma_atomic_pool_init+0x50/0xa4
dma_atomic_pool_init+0xac/0x158
do_one_initcall+0x50/0x218
kernel_init_freeable+0x22c/0x2d0
kernel_init+0x18/0x110
ret_from_fork+0x10/0x18
Code: aa1403e3 91106000 97f82a27 14000011 (f940c663)
---[ end trace 9795948475817de4 ]---
Kernel panic - not syncing: Fatal exception
Rebooting in 10 seconds..
Fixes: eccb52e788 ("mm: memcontrol: prepare swap controller setup for integration")
Reported-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/r/1593641660-13254-2-git-send-email-bhsharma@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under
mmap_read_lock(). It can lead to race with __do_munmap():
Thread A Thread B
__do_munmap()
detach_vmas_to_be_unmapped()
mmap_write_downgrade()
expand_downwards()
vma->vm_start = address;
// The VMA now overlaps with
// VMAs detached by the Thread A
// page fault populates expanded part
// of the VMA
unmap_region()
// Zaps pagetables partly
// populated by Thread B
Similar race exists for expand_upwards().
The fix is to avoid downgrading mmap_lock in __do_munmap() if detached
VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA.
[akpm@linux-foundation.org: s/mmap_sem/mmap_lock/ in comment]
Fixes: dd2283f260 ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.20+]
Link: http://lkml.kernel.org/r/20200709105309.42495-1-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ISA v3.1 does not support the SAO storage control attribute required to
implement PROT_SAO. PROT_SAO was used by specialised system software
(Lx86) that has been discontinued for about 7 years, and is not thought
to be used elsewhere, so removal should not cause problems.
We rather remove it than keep support for older processors, because
live migrating guest partitions to newer processors may not be possible
if SAO is in use (or worse allowed with silent races).
- PROT_SAO stays in the uapi header so code using it would still build.
- arch_validate_prot() is removed, the generic version rejects PROT_SAO
so applications would get a failure at mmap() time.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop KVM change for the time being]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-3-npiggin@gmail.com
In preparation for removing smp_read_barrier_depends() altogether,
move the Alpha code over to using smp_rmb() and smp_mb() directly.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Naresh Kamboju reported that the LTP tests can cause warnings on i386
going back all the way to v5.0, and bisected it to commit 2c91bd4a4e
("mm: speed up mremap by 20x on large regions").
The warning in move_normal_pmd() is actually mostly correct, but we have
a very unusual special case at process creation time, when we may move
the stack down with an overlapping mode (kind of like a "memmove()"
except using the page tables).
And when you have just the right condition of "move a large initial
stack by the right alignment in the end, but with the early part of the
move being only page-aligned", we'll be in a situation where we're
trying to move a normal PMD entry on top of an already existing - but
now empty - PMD entry.
The warning is still worth having, in case it ever triggers other cases,
and perhaps as a reminder that we could do the stack move case more
efficiently (although it's clearly rare enough that it probably doesn't
matter).
But make it do WARN_ON_ONCE(), so that you can't flood the logs with it.
And add a *big* comment above it to explain and remind us what's going
on, because it took some figuring out to see how this could trigger.
Kudos to Joel Fernandes for debugging this.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Debugged-and-acked-by: Joel Fernandes <joel@joelfernandes.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
debugfs_create_u32_array() allocates a small structure to wrap
the data and size information about the array. If users ever
try to remove the file this leads to a leak since nothing ever
frees this wrapper.
That said there are no upstream users of debugfs_create_u32_array()
that'd remove a u32 array file (we only have one u32 array user in
CMA), so there is no real bug here.
Make callers pass a wrapper they allocated. This way the lifetime
management of the wrapper is on the caller, and we can avoid the
potential leak in debugfs.
CC: Chucheng Luo <luochucheng@vivo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
hmm_range_fault() returns an array of page frame numbers and flags for how
the pages are mapped in the requested process' page tables. The PFN can be
used to get the struct page with hmm_pfn_to_page() and the page size order
can be determined with compound_order(page).
However, if the page is larger than order 0 (PAGE_SIZE), there is no
indication that a compound page is mapped by the CPU using a larger page
size. Without this information, the caller can't safely use a large device
PTE to map the compound page because the CPU might be using smaller PTEs
with different read/write permissions.
Add a new function hmm_pfn_to_map_order() to return the mapping size order
so that callers know the pages are being mapped with consistent
permissions and a large device page table mapping can be used if one is
available.
This will allow devices to optimize mapping the page into HW by avoiding
or batching work for huge pages. For instance the dma_map can be done with
a high order directly.
Link: https://lore.kernel.org/r/20200701225352.9649-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull gfs2 fixes from Andreas Gruenbacher:
"Fix gfs2 readahead deadlocks by adding a IOCB_NOIO flag that allows
gfs2 to use the generic fiel read iterator functions without having to
worry about being called back while holding locks".
* tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Rework read and page fault locking
fs: Add IOCB_NOIO flag for generic_file_read_iter
"physmem" in the memblock allocator is somewhat weird: it's not actually
used for allocation, it's simply information collected during boot, which
describes the unmodified physical memory map at boot time, without any
standby/hotplugged memory. It's only used on s390 and is currently the
only reason s390 keeps using CONFIG_ARCH_KEEP_MEMBLOCK.
Physmem isn't numa aware and current users don't specify any flags. Let's
hide it from the user, exposing only for_each_physmem(), and simplify. The
interface for physmem is now really minimalistic:
- memblock_physmem_add() to add ranges
- for_each_physmem() / __next_physmem_range() to walk physmem ranges
Don't place it into an __init section and don't discard it without
CONFIG_ARCH_KEEP_MEMBLOCK. As we're reusing __next_mem_range(), remove
the __meminit notifier to avoid section mismatch warnings once
CONFIG_ARCH_KEEP_MEMBLOCK is no longer used with
CONFIG_HAVE_MEMBLOCK_PHYS_MAP.
While fixing up the documentation, sneak in some related cleanups. We can
stop setting CONFIG_ARCH_KEEP_MEMBLOCK for s390 next.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Message-Id: <20200701141830.18749-2-david@redhat.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I realize that we fairly recently raised it to 4.8, but the fact is, 4.9
is a much better minimum version to target.
We have a number of workarounds for actual bugs in pre-4.9 gcc versions
(including things like internal compiler errors on ARM), but we also
have some syntactic workarounds for lacking features.
In particular, raising the minimum to 4.9 means that we can now just
assume _Generic() exists, which is likely the much better replacement
for a lot of very convoluted built-time magic with conditionals on
sizeof and/or __builtin_choose_expr() with same_type() etc.
Using _Generic also means that you will need to have a very recent
version of 'sparse', but thats easy to build yourself, and much less of
a hassle than some old gcc version can be.
The latest (in a long string) of reasons for minimum compiler version
upgrades was commit 5435f73d5c ("efi/x86: Fix build with gcc 4").
Ard points out that RHEL 7 uses gcc-4.8, but the people who stay back on
old RHEL versions persumably also don't build their own kernels anyway.
And maybe they should cross-built or just have a little side affair with
a newer compiler?
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
shouldn't trigger any filesystem I/O for the actual request or for
readahead. This allows to do tentative reads out of the page cache as
some filesystems allow, and to take the appropriate locks and retry the
reads only if the requested pages are not cached.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Some Makefiles already pass -fno-stack-protector unconditionally.
For example, arch/arm64/kernel/vdso/Makefile, arch/x86/xen/Makefile.
No problem report so far about hard-coding this option. So, we can
assume all supported compilers know -fno-stack-protector.
GCC 4.8 and Clang support this option (https://godbolt.org/z/_HDGzN)
Get rid of cc-option from -fno-stack-protector.
Remove CONFIG_CC_HAS_STACKPROTECTOR_NONE, which is always 'y'.
Note:
arch/mips/vdso/Makefile adds -fno-stack-protector twice, first
unconditionally, and second conditionally. I removed the second one.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Calling cma_declare_contiguous_nid() with false exact_nid for per-numa
reservation can easily cause cma leak and various confusion. For example,
mm/hugetlb.c is trying to reserve per-numa cma for gigantic pages. But it
can easily leak cma and make users confused when system has memoryless
nodes.
In case the system has 4 numa nodes, and only numa node0 has memory. if
we set hugetlb_cma=4G in bootargs, mm/hugetlb.c will get 4 cma areas for 4
different numa nodes. since exact_nid=false in current code, all 4 numa
nodes will get cma successfully from node0, but hugetlb_cma[1 to 3] will
never be available to hugepage will only allocate memory from
hugetlb_cma[0].
In case the system has 4 numa nodes, both numa node0&2 has memory, other
nodes have no memory. if we set hugetlb_cma=4G in bootargs, mm/hugetlb.c
will get 4 cma areas for 4 different numa nodes. since exact_nid=false in
current code, all 4 numa nodes will get cma successfully from node0 or 2,
but hugetlb_cma[1] and [3] will never be available to hugepage as
mm/hugetlb.c will only allocate memory from hugetlb_cma[0] and
hugetlb_cma[2]. This causes permanent leak of the cma areas which are
supposed to be used by memoryless node.
Of cource we can workaround the issue by letting mm/hugetlb.c scan all cma
areas in alloc_gigantic_page() even node_mask includes node0 only. that
means when node_mask includes node0 only, we can get page from
hugetlb_cma[1] to hugetlb_cma[3]. But this will cause kernel crash in
free_gigantic_page() while it wants to free page by:
cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order)
On the other hand, exact_nid=false won't consider numa distance, it might
be not that useful to leverage cma areas on remote nodes. I feel it is
much simpler to make exact_nid true to make everything clear. After that,
memoryless nodes won't be able to reserve per-numa CMA from other nodes
which have memory.
Fixes: cf11e85fc0 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andreas Schaufler <andreas.schaufler@gmx.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200628074345.27228-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The routine hpage_nr_pages() was incorrectly used to calculate the number
of base pages in a hugetlb page. hpage_nr_pages is designed to be called
for THP pages and will return HPAGE_PMD_NR for hugetlb pages of any size.
Due to the context in which hpage_nr_pages was called, it is unlikely to
produce a user visible error. The routine with the incorrect call is only
exercised in the case of hugetlb memory error or migration. In addition,
this would need to be on an architecture which supports huge page sizes
less than PMD_SIZE. And, the vma containing the huge page would also need
to smaller than PMD_SIZE.
Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200629185003.97202-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bio_associate_blkg_from_page is a special purpose helper for swap bios
that doesn't need access to bio internals. Move it to the swap code
instead of having it in bio.c.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When working with very large nodes, poisoning the struct pages (for which
there will be very many) can take a very long time. If the system is
using voluntary preemptions, the software watchdog will not be able to
detect forward progress. This patch addresses this issue by offering to
give up time like __remove_pages() does. This behavior was introduced in
v5.6 with: commit d33695b16a ("mm/memory_hotplug: poison memmap in
remove_pfn_range_from_zone()")
Alternately, init_page_poison could do this cond_resched(), but it seems
to me that the caller of init_page_poison() is what actually knows whether
or not it should relax its own priority.
Based on Dan's notes, I think this is perfectly safe: commit f931ab479d
("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
Aside from fixing the lockup, it is also a friendlier thing to do on lower
core systems that might wipe out large chunks of hotplug memory (probably
not a very common case).
Fixes this kind of splat:
watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
irq event stamp: 138450
hardirqs last enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
RIP: 0010:memset_erms+0x9/0x10
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
Call Trace:
remove_pfn_range_from_zone+0x3a/0x380
memunmap_pages+0x17f/0x280
release_nodes+0x22a/0x260
__device_release_driver+0x172/0x220
device_driver_detach+0x3e/0xa0
unbind_store+0x113/0x130
kernfs_fop_write+0xdc/0x1c0
vfs_write+0xde/0x1d0
ksys_write+0x58/0xd0
do_syscall_64+0x5a/0x120
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Built 2 zonelists, mobility grouping on. Total pages: 49050381
Policy zone: Normal
Built 3 zonelists, mobility grouping on. Total pages: 49312525
Policy zone: Normal
David said: "It really only is an issue for devmem. Ordinary
hotplugged system memory is not affected (onlined/offlined in memory
block granularity)."
Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
Fixes: commit d33695b16a ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reported-by: "Scargall, Steve" <steve.scargall@intel.com>
Reported-by: Ben Widawsky <ben.widawsky@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun reports seeing rare div0 crashes in memory.low stress testing:
RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150
Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49
RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246
RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b
RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000
RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000
R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8
FS: 0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0
Call Trace:
shrink_node+0x1e5/0x6c0
balance_pgdat+0x32d/0x5f0
kswapd+0x1d7/0x3d0
kthread+0x11c/0x160
ret_from_fork+0x1f/0x30
This happens when parent_usage == siblings_protected.
We check that usage is bigger than protected, which should imply
parent_usage being bigger than siblings_protected. However, we don't
read (or even update) these values atomically, and they can be out of
sync as the memory state changes under us. A bit of fluctuation around
the target protection isn't a big deal, but we need to handle the div0
case.
Check the parent state explicitly to make sure we have a reasonable
positive value for the divisor.
Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org
Fixes: 8a931f8013 ("mm: memcontrol: recursive memory.low protection")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 9e343b467c ("READ_ONCE: Enforce atomicity for
{READ,WRITE}_ONCE() memory accesses"), READ_ONCE() cannot be used
anymore to read complex page table entries.
This leads to:
CC mm/debug_vm_pgtable.o
In file included from ./include/asm-generic/bug.h:5,
from ./arch/powerpc/include/asm/bug.h:109,
from ./include/linux/bug.h:5,
from ./include/linux/mmdebug.h:5,
from ./include/linux/gfp.h:5,
from mm/debug_vm_pgtable.c:13:
In function 'pte_clear_tests',
inlined from 'debug_vm_pgtable' at mm/debug_vm_pgtable.c:363:2:
./include/linux/compiler.h:392:38: error: Unsupported access size for {READ,WRITE}_ONCE().
mm/debug_vm_pgtable.c:249:14: note: in expansion of macro 'READ_ONCE'
249 | pte_t pte = READ_ONCE(*ptep);
| ^~~~~~~~~
make[2]: *** [mm/debug_vm_pgtable.o] Error 1
Fix it by using the recently added ptep_get() helper.
Link: http://lkml.kernel.org/r/6ca8c972e6c920dc4ae0d4affbed9703afa4d010.1592490570.git.christophe.leroy@csgroup.eu
Fixes: 9e343b467c ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Murphy reports that a slightly overcommitted load, testing swap
and zram along with i915, splats and keeps on splatting, when it had
better fail less noisily:
gnome-shell: page allocation failure: order:0,
mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
nodemask=(null),cpuset=/,mems_allowed=0
CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
Call Trace:
dump_stack+0x64/0x88
warn_alloc.cold+0x75/0xd9
__alloc_pages_slowpath.constprop.0+0xcfa/0xd30
__alloc_pages_nodemask+0x2df/0x320
alloc_slab_page+0x195/0x310
allocate_slab+0x3c5/0x440
___slab_alloc+0x40c/0x5f0
__slab_alloc+0x1c/0x30
kmem_cache_alloc+0x20e/0x220
xas_nomem+0x28/0x70
add_to_swap_cache+0x321/0x400
__read_swap_cache_async+0x105/0x240
swap_cluster_readahead+0x22c/0x2e0
shmem_swapin+0x8e/0xc0
shmem_swapin_page+0x196/0x740
shmem_getpage_gfp+0x3a2/0xa60
shmem_read_mapping_page_gfp+0x32/0x60
shmem_get_pages+0x155/0x5e0 [i915]
__i915_gem_object_get_pages+0x68/0xa0 [i915]
i915_vma_pin+0x3fe/0x6c0 [i915]
eb_add_vma+0x10b/0x2c0 [i915]
i915_gem_do_execbuffer+0x704/0x3430 [i915]
i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
drm_ioctl_kernel+0x86/0xd0 [drm]
drm_ioctl+0x206/0x390 [drm]
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported on 5.7, but it goes back really to 3.1: when
shmem_read_mapping_page_gfp() was implemented for use by i915, and
allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
__read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
Fixes: 68da9f0557 ("tmpfs: pass gfp to shmem_getpage_gfp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Chris Murphy <lists@colorremedies.com>
Analyzed-by: Vlastimil Babka <vbabka@suse.cz>
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Chris Murphy <lists@colorremedies.com>
Cc: <stable@vger.kernel.org> [3.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was found that running the LTP test on a PowerPC system could produce
erroneous values in /proc/meminfo, like:
MemTotal: 531915072 kB
MemFree: 507962176 kB
MemAvailable: 1100020596352 kB
Using bisection, the problem is tracked down to commit 9c315e4d7d ("mm:
memcg/slab: cache page number in memcg_(un)charge_slab()").
In memcg_uncharge_slab() with a "int order" argument:
unsigned int nr_pages = 1 << order;
:
mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
The mod_lruvec_state() function will eventually call the
__mod_zone_page_state() which accepts a long argument. Depending on the
compiler and how inlining is done, "-nr_pages" may be treated as a
negative number or a very large positive number. Apparently, it was
treated as a large positive number in that PowerPC system leading to
incorrect stat counts. This problem hasn't been seen in x86-64 yet,
perhaps the gcc compiler there has some slight difference in behavior.
It is fixed by making nr_pages a signed value. For consistency, a similar
change is applied to memcg_charge_slab() as well.
Link: http://lkml.kernel.org/r/20200620184719.10994-1-longman@redhat.com
Fixes: 9c315e4d7d ("mm: memcg/slab: cache page number in memcg_(un)charge_slab()").
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh reports:
"While stressing compaction, one run oopsed on NULL capc->cc in
__free_one_page()'s task_capc(zone): compact_zone_order() had been
interrupted, and a page was being freed in the return from interrupt.
Though you would not expect it from the source, both gccs I was using
(4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
callq compact_zone - long after the "current->capture_control =
&capc". An interrupt in between those finds capc->cc NULL (zeroed by
an earlier rep stos).
This could presumably be fixed by a barrier() before setting
current->capture_control in compact_zone_order(); but would also need
more care on return from compact_zone(), in order not to risk leaking
a page captured by interrupt just before capture_control is reset.
Maybe that is the preferable fix, but I felt safer for task_capc() to
exclude the rather surprising possibility of capture at interrupt
time"
I have checked that gcc10 also behaves the same.
The advantage of fix in compact_zone_order() is that we don't add
another test in the page freeing hot path, and that it might prevent
future problems if we stop exposing pointers to uninitialized structures
in current task.
So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.
Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes: 5e1f0f098b ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.
We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, just in preparation for allowing
more callers.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull powerpc fixes from Michael Ellerman:
- One fix for the interrupt rework we did last release which broke
KVM-PR
- Three commits fixing some fallout from the READ_ONCE() changes
interacting badly with our 8xx 16K pages support, which uses a pte_t
that is a structure of 4 actual PTEs
- A cleanup of the 8xx pte_update() to use the newly added pmd_off()
- A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is
enabled
- A minor fix for the SPU syscall generation
Thanks to Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike
Rapoport, Nicholas Piggin.
* tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/8xx: Provide ptep_get() with 16k pages
mm: Allow arches to provide ptep_get()
mm/gup: Use huge_ptep_get() in gup_hugepte()
powerpc/syscalls: Use the number when building SPU syscall table
powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
powerpc/64s: Fix KVM interrupt using wrong save area
powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL