4dc0484e90a3cde1cb1514f1db1de3145d99cb18
419 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
98f401d363 |
mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse
commit 2555283eb40df89945557273121e9393ef9b542b upstream.
anon_vma->degree tracks the combined number of child anon_vmas and VMAs
that use the anon_vma as their ->anon_vma.
anon_vma_clone() then assumes that for any anon_vma attached to
src->anon_vma_chain other than src->anon_vma, it is impossible for it to
be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
elevated by 1 because of a child anon_vma, meaning that if ->degree
equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
This assumption is wrong because the ->degree optimization leads to leaf
nodes being abandoned on anon_vma_clone() - an existing anon_vma is
reused and no new parent-child relationship is created. So it is
possible to reuse an anon_vma for one VMA while it is still tied to
another VMA.
This is an issue because is_mergeable_anon_vma() and its callers assume
that if two VMAs have the same ->anon_vma, the list of anon_vmas
attached to the VMAs is guaranteed to be the same. When this assumption
is violated, vma_merge() can merge pages into a VMA that is not attached
to the corresponding anon_vma, leading to dangling page->mapping
pointers that will be dereferenced during rmap walks.
Fix it by separately tracking the number of child anon_vmas and the
number of VMAs using the anon_vma as their ->anon_vma.
Fixes:
|
||
![]() |
3c3fbfa6dd |
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: |
||
![]() |
38cda6b5ab |
mm/thp: fix page_address_in_vma() on file THP tails
commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 upstream.
Anon THP tails were already supported, but memory-failure may need to
use page_address_in_vma() on file THP tails, which its page->mapping
check did not permit: fix it.
hughd adds: no current usage is known to hit the issue, but this does
fix a subtle trap in a general helper: best fixed in stable sooner than
later.
Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes:
|
||
![]() |
37ffe9f4d7 |
mm/thp: fix vma_address() if virtual address below file offset
commit 494334e43c16d63b878536a26505397fce6ff3a2 upstream. Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). When that BUG() was changed to a WARN(), it would later crash on the VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap(). vma_address() is usually correct, but there's a wraparound case when the vm_start address is unusually low, but vm_pgoff not so low: vma_address() chooses max(start, vma->vm_start), but that decides on the wrong address, because start has become almost ULONG_MAX. Rewrite vma_address() to be more careful about vm_pgoff; move the VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can be safely used from page_mapped_in_vma() and page_address_in_vma() too. Add vma_address_end() to apply similar care to end address calculation, in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one(); though it raises a question of whether callers would do better to supply pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch. An irritation is that their apparent generality breaks down on KSM pages, which cannot be located by the page->index that page_to_pgoff() uses: as commit |
||
![]() |
66be14a926 |
mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 upstream.
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e. mapcount 0.
And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.
I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.
Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().
When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem. Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.
Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes:
|
||
![]() |
bfd90b56d7 |
mm/rmap: use page_not_mapped in try_to_unmap()
[ Upstream commit b7e188ec98b1644ff70a6d3624ea16aadc39f5e0 ] page_mapcount_is_zero() calculates accurately how many mappings a hugepage has in order to check against 0 only. This is a waste of cpu time. We can do this via page_not_mapped() to save some possible atomic_read cycles. Remove the function page_mapcount_is_zero() as it's not used anymore and move page_not_mapped() above try_to_unmap() to avoid identifier undeclared compilation error. Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
ff81af8259 |
mm/rmap: remove unneeded semicolon in page_not_mapped()
[ Upstream commit e0af87ff7afcde2660be44302836d2d5618185af ] Remove extra semicolon without any functional change intended. Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
dd156e3fca |
mm/rmap: always do TTU_IGNORE_ACCESS
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ] Since commit |
||
![]() |
336bf30eb7 |
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.
This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:
- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).
- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.
Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
Fixes:
|
||
![]() |
5eaf35ab12 |
mm/rmap: fix assumptions of THP size
Ask the page what size it is instead of assuming it's PMD size. Do this for anon pages as well as file pages for when someone decides to support that. Leave the assumption alone for pages which are PMD mapped; we don't currently grow THPs beyond PMD size, so we don't need to change this code yet. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-9-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
ad7df764b7 |
mm/rmap: fixup copying of soft dirty and uffd ptes
During memory migration a pte is temporarily replaced with a migration swap pte. Some pte bits from the existing mapping such as the soft-dirty and uffd write-protect bits are preserved by copying these to the temporary migration swap pte. However these bits are not stored at the same location for swap and non-swap ptes. Therefore testing these bits requires using the appropriate helper function for the given pte type. Unfortunately several code locations were found where the wrong helper function is being used to test soft_dirty and uffd_wp bits which leads to them getting incorrectly set or cleared during page-migration. Fix these by using the correct tests based on pte type. Fixes: |
||
![]() |
9c1177b62a |
mm/rmap: annotate a data race at tlb_flush_batched
mm->tlb_flush_batched could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6: try_to_unmap_one+0x59a/0x1ab0 set_tlb_ubc_flush_pending at mm/rmap.c:635 (inlined by) try_to_unmap_one at mm/rmap.c:1538 rmap_walk_anon+0x296/0x650 rmap_walk+0xdf/0x100 try_to_unmap+0x18a/0x2f0 shrink_page_list+0xef6/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 balance_pgdat+0x652/0xd90 kswapd+0x396/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4: flush_tlb_batched_pending+0x29/0x90 flush_tlb_batched_pending at mm/rmap.c:682 change_p4d_range+0x5dd/0x1030 change_pte_range at mm/mprotect.c:44 (inlined by) change_pmd_range at mm/mprotect.c:212 (inlined by) change_pud_range at mm/mprotect.c:240 (inlined by) change_p4d_range at mm/mprotect.c:260 change_protection+0x222/0x310 change_prot_numa+0x3e/0x60 task_numa_work+0x219/0x350 task_work_run+0xed/0x140 prepare_exit_to_usermode+0x2cc/0x2e0 ret_from_intr+0x32/0x42 Reported by Kernel Concurrency Sanitizer on: CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 flush_tlb_batched_pending() is under PTL but the write is not, but mm->tlb_flush_batched is only a bool type, so the value is unlikely to be shattered. Thus, mark it as an intentional data race by using the data race macro. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
6c357848b4 |
mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
34ae204f18 |
hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Commit |
||
![]() |
c1e8d7c6a7 |
mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
468c398233 |
mm: memcontrol: switch to native NR_ANON_THPS counter
With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the native NR_ANON_THPS accounting sites. [hannes@cmpxchg.org: fixes] Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> [build-tested] Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
be5d0a74c6 |
mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
4708f31885 |
mm: prevent a warning when casting void* -> enum
I recently build the RISC-V port with LLVM trunk, which has introduced a new warning when casting from a pointer to an enum of a smaller size. This patch simply casts to a long in the middle to stop the warning. I'd be surprised this is the only one in the kernel, but it's the only one I saw. Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
f45ec5ff16 |
userfaultfd: wp: support swap and page migration
For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
396bcc5299 |
mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE
Commit |
||
![]() |
23ab76bf90 |
Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork"
This reverts commit |
||
![]() |
c0d0381ade |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
222100eed2 |
mm/vma: make is_vma_temporary_stack() available for general use
Currently the declaration and definition for is_vma_temporary_stack() are scattered. Lets make is_vma_temporary_stack() helper available for general use and also drop the declaration from (include/linux/huge_mm.h) which is no longer required. While at this, rename this as vma_is_temporary_stack() in line with existing helpers. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
47e29d32af |
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
f1fe80d4ae |
mm, thp: do not queue fully unmapped pages for deferred split
Adding fully unmapped pages into deferred split queue is not productive: these pages are about to be freed or they are pinned and cannot be split anyway. Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
30c4638285 |
mm/rmap.c: use VM_BUG_ON_PAGE() in __page_check_anon_rmap()
The __page_check_anon_rmap() just calls two BUG_ON()s protected by CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE(). Link: http://lkml.kernel.org/r/1573157346-111316-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
091e429954 |
mm/rmap.c: fix outdated comment in page_get_anon_vma()
Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because
SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit
|
||
![]() |
4e4a9eb921 |
mm/rmap.c: reuse mergeable anon_vma as parent when fork
In __anon_vma_prepare(), we will try to find anon_vma if it is possible to reuse it. While on fork, the logic is different. Since commit |
||
![]() |
47b390d23b |
mm/rmap.c: don't reuse anon_vma if we just want a copy
Before commit |
||
![]() |
444f84fd2a |
mm: include <linux/huge_mm.h> for is_vma_temporary_stack
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack to fix the following sparse warning: mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Reviewed-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
99cb0dbd47 |
mm,thp: add read-only THP support for (non-shmem) FS
This patch is (hopefully) the first step to enable THP for non-shmem filesystems. This patch enables an application to put part of its text sections to THP via madvise, for example: madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE); We tried to reuse the logic for THP on tmpfs. Currently, write is not supported for non-shmem THP. khugepaged will only process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is execve(). This requirement limits non-shmem THP to text sections. The next patch will handle writes, which would only happen when the all the vmas with VM_DENYWRITE are unmapped. An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this feature. [songliubraving@fb.com: fix build without CONFIG_SHMEM] Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com [songliubraving@fb.com: fix double unlock in collapse_file()] Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
d8c6546b1a |
mm: introduce compound_nr()
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
a50b854e07 |
mm: introduce page_size()
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
1f18b29669 |
mm/rmap.c: remove set but not used variable 'cstart'
Fixes gcc '-Wunused-but-set-variable' warning:
mm/rmap.c: In function page_mkclean_one:
mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]
It is not used any more since
commit
|
||
![]() |
1de13ee592 |
mm/hmm: fix bad subpage pointer in try_to_unmap_one
When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased. This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.
However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:
BUG: unable to handle page fault for address: ffffea1fffffffc8
Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".
[rcampbell@nvidia.com: add comment]
Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
Fixes:
|
||
![]() |
059d8442ea |
mm/rmap.c: use the pra.mapcount to do the check
We have the pra.mapcount already, and there is no need to call the page_mapped() which may do some complicated computing for compound page. Link: http://lkml.kernel.org/r/20190404054828.2731-1-sjhuang@iluvatar.ai Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
7269f99993 |
mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
6f4f13e8d9 |
mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
024eee0e83 |
mm: page_mkclean vs MADV_DONTNEED race
MADV_DONTNEED is handled with mmap_sem taken in read mode. We call page_mkclean without holding mmap_sem. MADV_DONTNEED implies that pages in the region are unmapped and subsequent access to the pages in that range is handled as a new page fault. This implies that if we don't have parallel access to the region when MADV_DONTNEED is run we expect those range to be unallocated. w.r.t page_mkclean() we need to make sure that we don't break the MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock. This implies we skip the pmd if we temporarily mark pmd none. Avoid doing that while marking the page clean. Keep the sequence same for dax too even though we don't support MADV_DONTNEED for dax mapping The bug was noticed by code review and I didn't observe any failures w.r.t test run. This is similar to commit |
||
![]() |
f4b7e272b5 |
mm: remove zone_lru_lock() function, access ->lru_lock directly
We have common pattern to access lru_lock from a page pointer: zone_lru_lock(page_zone(page)) Which is silly, because it unfolds to this: &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock while we can simply do &NODE_DATA(page_to_nid(page))->lru_lock Remove zone_lru_lock() function, since it's only complicate things. Use 'page_pgdat(page)->lru_lock' pattern instead. [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()] Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
ba42273131 |
mm/mmu_notifier: mm/rmap.c: Fix a mmu_notifier range bug in try_to_unmap_one
The conversion to use a structure for mmu_notifier_invalidate_range_*()
unintentionally changed the usage in try_to_unmap_one() to init the
'struct mmu_notifier_range' with vma->vm_start instead of @address,
i.e. it invalidates the wrong address range. Revert to the correct
address range.
Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page
state in process X" errors when reclaiming from a KVM guest due to KVM
removing the wrong pages from its own mappings.
Reported-by: leozinho29_eu@hotmail.com
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Pankaj gupta <pagupta@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
![]() |
ddeaab32a8 |
hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
This reverts
|
||
![]() |
b43a999005 |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes:
|
||
![]() |
451b9514a5 |
mm: remove __hugepage_set_anon_rmap()
This function is identical to __page_set_anon_rmap() since the time, when it was introduced (8 years ago). The patch removes the function, and makes its users to use __page_set_anon_rmap() instead. Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
ac46d4f3c4 |
mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
906f9cdfc2 |
mm/huge_memory: rename freeze_page() to unmap_page()
The term "freeze" is used in several ways in the kernel, and in mm it
has the particular meaning of forcing page refcount temporarily to 0.
freeze_page() is just too confusing a name for a function that unmaps a
page: rename it unmap_page(), and rename unfreeze_page() remap_page().
Went to change the mention of freeze_page() added later in mm/rmap.c,
but found it to be incorrect: ordinary page reclaim reaches there too;
but the substance of the comment still seems correct, so edit it down.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
Fixes:
|
||
![]() |
017b1660df |
mm: migration: fix migration of huge PMD shared pages
The page migration code employs try_to_unmap() to try and unmap the source
page. This is accomplished by using rmap_walk to find all vmas where the
page is mapped. This search stops when page mapcount is zero. For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings. Shared mappings are tracked via the reference count of the PMD
page. Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.
This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors. DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages. A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page. After this, flush caches and TLB.
mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked. Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.
Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes:
|
||
![]() |
bce73e4842 |
mm: do not drop unused pages when userfaultd is running
KVM guests on s390 can notify the host of unused pages. This can result in pte_unused callbacks to be true for KVM guest memory. If a page is unused (checked with pte_unused) we might drop this page instead of paging it. This can have side-effects on userfaultd, when the page in question was already migrated: The next access of that page will trigger a fault and a user fault instead of faulting in a new and empty zero page. As QEMU does not expect a userfault on an already migrated page this migration will fail. The most straightforward solution is to ignore the pte_unused hint if a userfault context is active for this VMA. Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
ccf2b06794 |
Merge tag 'v4.17-rc2' into docs-next
Merge -rc2 to pick up the changes to Documentation/core-api/kernel-api.rst that hit mainline via the networking tree. In their absence, subsequent patches cannot be applied. |
||
![]() |
e71769ae52 |
mm: enable thp migration for shmem thp
My testing for the latest kernel supporting thp migration showed an
infinite loop in offlining the memory block that is filled with shmem
thps. We can get out of the loop with a signal, but kernel should return
with failure in this case.
What happens in the loop is that scan_movable_pages() repeats returning
the same pfn without any progress. That's because page migration always
fails for shmem thps.
In memory offline code, memory blocks containing unmovable pages should be
prevented from being offline targets by has_unmovable_pages() inside
start_isolate_page_range(). So it's possible to change migratability for
non-anonymous thps to avoid the issue, but it introduces more complex and
thp-specific handling in migration code, so it might not good.
So this patch is suggesting to fix the issue by enabling thp migration for
shmem thp. Both of anon/shmem thp are migratable so we don't need
precheck about the type of thps.
Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
Fixes: commit
|