We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.
The logic aims to prevent userspace from doing bad accesses below the
stack pointer. However as long as the stack is < 1MB in size, we allow
all accesses without further checks. Adding some debug I see that I
can do a full kernel build and LTP run, and not a single process has
used more than 1MB of stack. So for the majority of processes the
logic never even fires.
We also recently found a nasty bug in this code which could cause
userspace programs to be killed during signal delivery. It went
unnoticed presumably because most processes use < 1MB of stack.
The generic mm code has also grown support for stack guard pages since
this code was originally written, so the most heinous case of the
stack expanding into other mappings is now handled for us.
Finally although some other arches have special logic in this path,
from what I can tell none of x86, arm64, arm and s390 impose any extra
checks other than those in expand_stack().
So drop our complicated logic and like other architectures just let
the stack expand as long as its within the rlimit.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Daniel Axtens <dja@axtens.net>
Link: https://lore.kernel.org/r/20200724092528.1578671-4-mpe@ellerman.id.au
We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.
The code was originally added in 2004 "ported from 2.4". The rough
logic is that the stack is allowed to grow to 1MB with no extra
checking. Over 1MB the access must be within 2048 bytes of the stack
pointer, or be from a user instruction that updates the stack pointer.
The 2048 byte allowance below the stack pointer is there to cover the
288 byte "red zone" as well as the "about 1.5kB" needed by the signal
delivery code.
Unfortunately since then the signal frame has expanded, and is now
4224 bytes on 64-bit kernels with transactional memory enabled. This
means if a process has consumed more than 1MB of stack, and its stack
pointer lies less than 4224 bytes from the next page boundary, signal
delivery will fault when trying to expand the stack and the process
will see a SEGV.
The total size of the signal frame is the size of struct rt_sigframe
(which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
64-bit).
The 2048 byte allowance was correct until 2008 as the signal frame
was:
struct rt_sigframe {
struct ucontext uc; /* 0 1440 */
/* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
long unsigned int _unused[2]; /* 1440 16 */
unsigned int tramp[6]; /* 1456 24 */
struct siginfo * pinfo; /* 1480 8 */
void * puc; /* 1488 8 */
struct siginfo info; /* 1496 128 */
/* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
char abigap[288]; /* 1624 288 */
/* size: 1920, cachelines: 15, members: 7 */
/* padding: 8 */
};
1920 + 128 = 2048
Then in commit ce48b21007 ("powerpc: Add VSX context save/restore,
ptrace and signal support") (Jul 2008) the signal frame expanded to
2304 bytes:
struct rt_sigframe {
struct ucontext uc; /* 0 1696 */ <--
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
long unsigned int _unused[2]; /* 1696 16 */
unsigned int tramp[6]; /* 1712 24 */
struct siginfo * pinfo; /* 1736 8 */
void * puc; /* 1744 8 */
struct siginfo info; /* 1752 128 */
/* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
char abigap[288]; /* 1880 288 */
/* size: 2176, cachelines: 17, members: 7 */
/* padding: 8 */
};
2176 + 128 = 2304
At this point we should have been exposed to the bug, though as far as
I know it was never reported. I no longer have a system old enough to
easily test on.
Then in 2010 commit 320b2b8de1 ("mm: keep a guard page below a
grow-down stack segment") caused our stack expansion code to never
trigger, as there was always a VMA found for a write up to PAGE_SIZE
below r1.
That meant the bug was hidden as we continued to expand the signal
frame in commit 2b0a576d15 ("powerpc: Add new transactional memory
state to the signal context") (Feb 2013):
struct rt_sigframe {
struct ucontext uc; /* 0 1696 */
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
struct ucontext uc_transact; /* 1696 1696 */ <--
/* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
long unsigned int _unused[2]; /* 3392 16 */
unsigned int tramp[6]; /* 3408 24 */
struct siginfo * pinfo; /* 3432 8 */
void * puc; /* 3440 8 */
struct siginfo info; /* 3448 128 */
/* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
char abigap[288]; /* 3576 288 */
/* size: 3872, cachelines: 31, members: 8 */
/* padding: 8 */
/* last cacheline: 32 bytes */
};
3872 + 128 = 4000
And commit 573ebfa660 ("powerpc: Increase stack redzone for 64-bit
userspace to 512 bytes") (Feb 2014):
struct rt_sigframe {
struct ucontext uc; /* 0 1696 */
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
struct ucontext uc_transact; /* 1696 1696 */
/* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
long unsigned int _unused[2]; /* 3392 16 */
unsigned int tramp[6]; /* 3408 24 */
struct siginfo * pinfo; /* 3432 8 */
void * puc; /* 3440 8 */
struct siginfo info; /* 3448 128 */
/* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
char abigap[512]; /* 3576 512 */ <--
/* size: 4096, cachelines: 32, members: 8 */
/* padding: 8 */
};
4096 + 128 = 4224
Then finally in 2017, commit 1be7107fbe ("mm: larger stack guard
gap, between vmas") exposed us to the existing bug, because it changed
the stack VMA to be the correct/real size, meaning our stack expansion
code is now triggered.
Fix it by increasing the allowance to 4224 bytes.
Hard-coding 4224 is obviously unsafe against future expansions of the
signal frame in the same way as the existing code. We can't easily use
sizeof() because the signal frame structure is not in a header. We
will either fix that, or rip out all the custom stack expansion
checking logic entirely.
Fixes: ce48b21007 ("powerpc: Add VSX context save/restore, ptrace and signal support")
Cc: stable@vger.kernel.org # v2.6.27+
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au
Commit 2f92447f9f ("powerpc/book3s64/hash: Use the pte_t address from the
caller") removed the local_irq_disable from hash_preload, but it was
required for more than just the page table walk: the hash pte busy bit is
effectively a lock which may be taken in interrupt context, and the local
update flag test must not be preempted before it's used.
This solves apparent lockups with perf interrupting __hash_page_64K. If
get_perf_callchain then also takes a hash fault on the same page while it
is already locked, it will loop forever taking hash faults, which looks like
this:
cpu 0x49e: Vector: 100 (System Reset) at [c00000001a4f7d70]
pc: c000000000072dc8: hash_page_mm+0x8/0x800
lr: c00000000000c5a4: do_hash_page+0x24/0x38
sp: c0002ac1cc69ac70
msr: 8000000000081033
current = 0xc0002ac1cc602e00
paca = 0xc00000001de1f280 irqmask: 0x03 irq_happened: 0x01
pid = 20118, comm = pread2_processe
Linux version 5.8.0-rc6-00345-g1fad14f18bc6
49e:mon> t
[c0002ac1cc69ac70] c00000000000c5a4 do_hash_page+0x24/0x38 (unreliable)
--- Exception: 300 (Data Access) at c00000000008fa60 __copy_tofrom_user_power7+0x20c/0x7ac
[link register ] c000000000335d10 copy_from_user_nofault+0xf0/0x150
[c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
[c0002ac1cc69afa0] c000000000109df0 read_user_stack_64+0x70/0xf0
[c0002ac1cc69afd0] c000000000109fcc perf_callchain_user_64+0x15c/0x410
[c0002ac1cc69b060] c000000000109c00 perf_callchain_user+0x20/0x40
[c0002ac1cc69b080] c00000000031c6cc get_perf_callchain+0x25c/0x360
[c0002ac1cc69b120] c000000000316b50 perf_callchain+0x70/0xa0
[c0002ac1cc69b140] c000000000316ddc perf_prepare_sample+0x25c/0x790
[c0002ac1cc69b1a0] c000000000317350 perf_event_output_forward+0x40/0xb0
[c0002ac1cc69b220] c000000000306138 __perf_event_overflow+0x88/0x1a0
[c0002ac1cc69b270] c00000000010cf70 record_and_restart+0x230/0x750
[c0002ac1cc69b620] c00000000010d69c perf_event_interrupt+0x20c/0x510
[c0002ac1cc69b730] c000000000027d9c performance_monitor_exception+0x4c/0x60
[c0002ac1cc69b750] c00000000000b2f8 performance_monitor_common_virt+0x1b8/0x1c0
--- Exception: f00 (Performance Monitor) at c0000000000cb5b0 pSeries_lpar_hpte_insert+0x0/0x160
[link register ] c0000000000846f0 __hash_page_64K+0x210/0x540
[c0002ac1cc69ba50] 0000000000000000 (unreliable)
[c0002ac1cc69bb00] c000000000073ae0 update_mmu_cache+0x390/0x3a0
[c0002ac1cc69bb70] c00000000037f024 wp_page_copy+0x364/0xce0
[c0002ac1cc69bc20] c00000000038272c do_wp_page+0xdc/0xa60
[c0002ac1cc69bc70] c0000000003857bc handle_mm_fault+0xb9c/0x1b60
[c0002ac1cc69bd50] c00000000006c434 __do_page_fault+0x314/0xc90
[c0002ac1cc69be20] c00000000000c5c8 handle_page_fault+0x10/0x2c
--- Exception: 300 (Data Access) at 00007fff8c861fe8
SP (7ffff6b19660) is in userspace
Fixes: 2f92447f9f ("powerpc/book3s64/hash: Use the pte_t address from the caller")
Reported-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727060947.10060-1-npiggin@gmail.com
MAX_NUMNODES is a theoretical maximum number of nodes thats is
supported by the kernel. Device tree properties exposes the number of
possible nodes on the current platform. The kernel would detected this
and would use it for most of its resource allocations. If the platform
now increases the nodes to over what was already exposed, then it may
lead to inconsistencies. Hence limit it to the already exposed nodes.
Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724105809.24733-1-srikar@linux.vnet.ibm.com
POWER6 only supports AMR update via privileged mode (MSR[PR] = 0,
SPRN_AMR=29) The PR=1 (userspace) alias for that SPR (SPRN_AMR=13) was
only supported from POWER7. Since we don't allow userspace modifying
of AMR value we should disable pkey support on P6 and before.
The hypervisor will still report pkey support via
"ibm,processor-storage-keys". Hence also check for P7 CPU_FTR bit to
decide on pkey support.
Fixes: f491fe3fb4 ("powerpc/book3s64/pkeys: Simplify the key initialization")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726132517.399076-1-aneesh.kumar@linux.ibm.com
ISA v3.1 does not support the SAO storage control attribute required to
implement PROT_SAO. PROT_SAO was used by specialised system software
(Lx86) that has been discontinued for about 7 years, and is not thought
to be used elsewhere, so removal should not cause problems.
We rather remove it than keep support for older processors, because
live migrating guest partitions to newer processors may not be possible
if SAO is in use (or worse allowed with silent races).
- PROT_SAO stays in the uapi header so code using it would still build.
- arch_validate_prot() is removed, the generic version rejects PROT_SAO
so applications would get a failure at mmap() time.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop KVM change for the time being]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-3-npiggin@gmail.com
UAMOR values are not application-specific. The kernel initializes
its value based on different reserved keys. Remove the thread-specific
UAMOR value and don't switch the UAMOR on context switch.
Move UAMOR initialization to key initialization code and remove
thread_struct.uamor because it is not used anymore.
Before commit: 4a4a5e5d2a ("powerpc/pkeys: key allocation/deallocation must not change pkey registers")
we used to update uamor based on key allocation and free.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-20-aneesh.kumar@linux.ibm.com
As we kexec across kernels that use AMR/IAMR for different purposes
we need to ensure that new kernels get kexec'd with a reset value
of AMR/IAMR. For ex: the new kernel can use key 0 for kernel mapping and the old
AMR value prevents access to key 0.
This patch also removes reset if IAMR and AMOR in kexec_sequence. Reset of AMOR
is not needed and the IAMR reset is partial (it doesn't do the reset
on secondary cpus) and is redundant with this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-19-aneesh.kumar@linux.ibm.com
To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.
Pseries systems supports max LMB size of 256MB. Hence on pseries,
we now end up mapping memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range. That was added that as part of
commit b6eca183e2 ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")
But PowerVM doesn't provide that hint yet. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.
With this change KVM guest will also be doing linear mapping with
2M page size.
The actual TLB benefit of mapping guest page table entries with
hugepage size can only be materialized if the partition scoped
entries are also using the same or higher page size. A guest using
1G hugetlbfs backing guest memory can have a performance impact with
the above change.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Fold in fix from Aneesh spotted by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-5-aneesh.kumar@linux.ibm.com
We split the page table mapping on memory unplug if the
linear range was mapped with huge page mapping (for ex: 1G)
The page table splitting code has a few issues:
1. Recursive locking
--------------------
Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.
2. BUG: sleeping function called from in_atomic() context
---------------------------------------------------------
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()
3. Bad unlock unbalance
-----------------------
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.
These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. The kernel always make sure the minimum unplug request is
SUBSECTION_SIZE for device memory and SECTION_SIZE for regular memory.
In preparation for such a change remove page table splitting support.
This essentially is a revert of
commit 4dd5f8a99e ("powerpc/mm/radix: Split linear mapping on hot-unplug")
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-4-aneesh.kumar@linux.ibm.com
We can hit the following BUG_ON during memory unplug:
kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74
This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.
This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.
Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:
1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.
128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)
With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K
2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.
1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)
With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-2-aneesh.kumar@linux.ibm.com
Previous changes have removed the code which sets bits in
cpu_associativity_changes_mask and thus it is never modifed at
runtime. From this we can reason that numa_update_cpu_topology()
always returns 0 without doing anything. Remove the body of
numa_update_cpu_topology() and remove all code which becomes
unreachable as a result.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-10-nathanl@linux.ibm.com
Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector. Let prom_init ask for radix GTSE only if the
support exists.
Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703053608.12884-2-bharata@linux.ibm.com