Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
Recently in commit 7241d26e81 ("powerpc/64: properly initialise
the stackprotector canary on SMP.") we fixed a crash with stack
protector on SMP by initialising the stack canary in
cpu_idle_thread_init().
But this can also causes crashes, when a CPU comes back online after
being offline:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: pnv_smp_cpu_kill_self+0x2a0/0x2b0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-rc3-gcc-7.3.1-00168-g4ffe713b7587 #94
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
panic+0x144/0x328
__stack_chk_fail+0x2c/0x30
pnv_smp_cpu_kill_self+0x2a0/0x2b0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x274/0x390
cpu_startup_entry+0x38/0x50
start_secondary+0x5e4/0x600
start_secondary_prolog+0x10/0x14
Looking at the stack we see that the canary value in the stack frame
doesn't match the canary in the task/paca. That is because we have
reinitialised the task/paca value, but then the CPU coming online has
returned into a function using the old canary value. That causes the
comparison to fail.
Instead we can call boot_init_stack_canary() from start_secondary()
which never returns. This is essentially what the generic code does in
cpu_startup_entry() under #ifdef X86, we should make that non-x86
specific in a future patch.
Fixes: 7241d26e81 ("powerpc/64: properly initialise the stackprotector canary on SMP.")
Reported-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
commit b96672dd84 ("powerpc: Machine check interrupt is a non-
maskable interrupt") added a call to nmi_enter() at the beginning of
machine check restart exception handler. Due to that, in_interrupt()
always returns true regardless of the state before entering the
exception, and die() panics even when the system was not already in
interrupt.
This patch calls nmi_exit() before calling die() in order to restore
the interrupt state we had before calling nmi_enter()
Fixes: b96672dd84 ("powerpc: Machine check interrupt is a non-maskable interrupt")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The recent module relocation overflow crash demonstrated that we
have no range checking on REL32 relative relocations. This patch
implements a basic check, the same kernel that previously oopsed
and rebooted now continues with some of these errors when loading
the module:
module_64: x_tables: REL32 527703503449812 out of range!
Possibly other relocations (ADDR32, REL16, TOC16, etc.) should also have
overflow checks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, we expect to be able to reach ftrace_caller() from all
ftrace-enabled functions through a single relative branch. With large
kernel configs, we see functions outside of 32MB of ftrace_caller()
causing ftrace_init() to bail.
In such configurations, gcc/ld emits two types of trampolines for mcount():
1. A long_branch, which has a single branch to mcount() for functions that
are one hop away from mcount():
c0000000019e8544 <00031b56.long_branch._mcount>:
c0000000019e8544: 4a 69 3f ac b c00000000007c4f0 <._mcount>
2. A plt_branch, for functions that are farther away from mcount():
c0000000051f33f8 <0008ba04.plt_branch._mcount>:
c0000000051f33f8: 3d 82 ff a4 addis r12,r2,-92
c0000000051f33fc: e9 8c 04 20 ld r12,1056(r12)
c0000000051f3400: 7d 89 03 a6 mtctr r12
c0000000051f3404: 4e 80 04 20 bctr
We can reuse those trampolines for ftrace if we can have those
trampolines go to ftrace_caller() instead. However, with ABIv2, we
cannot depend on r2 being valid. As such, we use only the long_branch
trampolines by patching those to instead branch to ftrace_caller or
ftrace_regs_caller.
In addition, we add additional trampolines around .text and .init.text
to catch locations that are covered by the plt branches. This allows
ftrace to work with most large kernel configurations.
For now, we always patch the trampolines to go to ftrace_regs_caller,
which is slightly inefficient. This can be optimized further at a later
point.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If CONFIG_PPC_SPLPAR is not selected, steal_time will always
be NUL, so accounting it is pointless
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.
Removing it on PPC32 significantly reduces the size of
vtime_account_system() and vtime_account_idle() on an 8xx:
Before:
00000000 l F .text 000000a8 vtime_delta
00000280 g F .text 0000010c vtime_account_system
0000038c g F .text 00000048 vtime_account_idle
After:
(vtime_delta gets inlined inside the two functions)
000001d8 g F .text 000000a0 vtime_account_system
00000278 g F .text 00000038 vtime_account_idle
In terms of performance, we also get approximatly 7% improvement on
task switch. The following small benchmark app is run with perf stat:
void *thread(void *arg)
{
int i;
for (i = 0; i < atoi((char*)arg); i++)
pthread_yield();
}
int main(int argc, char **argv)
{
pthread_t th1, th2;
pthread_create(&th1, NULL, thread, argv[1]);
pthread_create(&th2, NULL, thread, argv[1]);
pthread_join(th1, NULL);
pthread_join(th2, NULL);
return 0;
}
Before the patch:
Performance counter stats for 'chrt -f 98 ./sched 100000' (50 runs):
8228.476465 task-clock (msec) # 0.954 CPUs utilized ( +- 0.23% )
200004 context-switches # 0.024 M/sec ( +- 0.00% )
After the patch:
Performance counter stats for 'chrt -f 98 ./sched 100000' (50 runs):
7649.070444 task-clock (msec) # 0.955 CPUs utilized ( +- 0.27% )
200004 context-switches # 0.026 M/sec ( +- 0.00% )
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.
In preparation of the following patch that will remove
CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC32, this patch moves
all scaled cputing accounting logic into dedicated functions.
This patch doesn't change any functionality. It's only code
reorganisation.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
module_frob_arch_sections() is called before the module is moved to its
final location. The function descriptor section addresses we are setting
here are thus invalid. Fix this by processing opd section during
module_finalize()
Fixes: 5633e85b2c ("powerpc64: Add .opd based function descriptor dereference")
Cc: stable@vger.kernel.org # v4.16
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Like all other dma mapping drivers just return an error code instead
of an actual memory buffer. The reason for the overflow buffer was
that at the time swiotlb was invented there was no way to check for
dma mapping errors, but this has long been fixed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In the recent commit 8b78fdb045 ("powerpc/time: Use
clockevents_register_device(), fixing an issue with large
decrementer") we changed the way we initialise the decrementer
clockevent(s).
We no longer initialise the mult & shift values of
decrementer_clockevent itself.
This has the effect of breaking PR KVM, because it uses those values
in kvmppc_emulate_dec(). The symptom is guest kernels spin forever
mid-way through boot.
For now fix it by assigning back to decrementer_clockevent the mult
and shift values.
Fixes: 8b78fdb045 ("powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back when I added -Werror in commit ba55bd7436 ("powerpc: Add
configurable -Werror for arch/powerpc") I did it by adding it to most
of the arch Makefiles.
At the time we excluded math-emu, because apparently it didn't build
cleanly. But that seems to have been fixed somewhere in the interim.
So move the -Werror addition to the top-level of the arch, this saves
us from repeating it in every Makefile and means we won't forget to
add it to any new sub-dirs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
do_exit() already includes a test to panic() is in_interrupt()
This patch removes powerpc one which is redundant.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When creating the boot-time FDT from an actual Open Firmware live
tree, let's generate "phandle" properties for the phandles instead
of the old deprecated "linux,phandle".
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Unsplit warning printf()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
prom_init.c must not modify the kernel image outside
of the .bss.prominit section. Thus make sure that
prom_init.o doesn't have anything in any of these:
.data
.bss
.init.data
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This makes __prombss its own section, and for now store
it in .bss.
This will give us the ability later to store it elsewhere
and/or free it after boot (it's about 8KB).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As they are no longer used past the end of prom_init
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Make the existing initialized definition constant and copy
it to a __prombss copy
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We removed support for running under any OPAL version
earlier than v3 in 2015 (they never saw the light of day
anyway), but we kept some leftovers of this support in
prom_init.c, so let's take it out.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This replaces all occurrences of __initdata for uninitialized
data with a new __prombss
Currently __promdata is defined to be __initdata but we'll
eventually change that.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch implements support for discovering storage class memory
devices at boot and for handling hotplug of new regions via RTAS
hotplug events.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Fix CONFIG_MEMORY_HOTPLUG=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When printing the machine check cause, the cause appears on the
following line due to bad use of printk without \n:
[ 33.663993] Machine check in kernel mode.
[ 33.664011] Caused by (from SRR1=9032):
[ 33.664036] Data access error at address c90c8000
This patch fixes it by using pr_cont() for the second part:
[ 133.258131] Machine check in kernel mode.
[ 133.258146] Caused by (from SRR1=9032): Data access error at address c90c8000
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
slb_flush_and_rebolt() is misleading, it is called in virtual mode, so
it can not possibly change the stack, so it should not be touching the
shadow area. And since vmalloc is no longer bolted, it should not
change any bolted mappings at all.
Change the name to slb_flush_and_restore_bolted(), and have it just
load the kernel stack from what's currently in the shadow SLB area.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory must not be accessed when handling kernel
space SLB misses, so care should be taken there. However user SLB
misses can access any kernel memory, which can be used to move some
fields out of the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to
bad address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Disallow tracing for all of slb.c for now.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPR is the odd register out when it comes to interrupt handling, it is
saved in current->thread.ppr while all others are saved on the stack.
The difficulty with this is that accessing thread.ppr can cause a SLB
fault, but the SLB fault handler implementation in C change had
assumed the normal exception entry handlers would not cause an SLB
fault.
Fix this by allocating room in the interrupt stack to save PPR.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we've split the user & kernel versions of pt_regs we need to
be more careful in the ptrace code.
For now we've ensured the location of the fields in both structs is
the same, so most of the ptrace code doesn't need updating.
But there are a few places where we use sizeof(pt_regs), and these
will be wrong as soon as we increase the size of the kernel structure.
So flip them all to use sizeof(user_pt_regs).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We use a shared definition for struct pt_regs in uapi/asm/ptrace.h.
That means the layout of the structure is ABI, ie. we can't change it.
That would be fine if it was only used to describe the user-visible
register state of a process, but it's also the struct we use in the
kernel to describe the registers saved in an interrupt frame.
We'd like more flexibility in the content (and possibly layout) of the
kernel version of the struct, but currently that's not possible.
So split the definition into a user-visible definition which remains
unchanged, and a kernel internal one.
At the moment they're still identical, and we check that at build
time. That's because we have code (in ptrace etc.) that assumes that
they are the same. We will fix that code in future patches, and then
we can break the strict symmetry between the two structs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
_PAGE_PRIVILEGED corresponds to the SH bit which doesn't protect
against user access but only disables ASID verification on kernel
accesses. User access is controlled with _PMD_USER flag.
Name it _PAGE_SH instead of _PAGE_PRIVILEGED
_PAGE_HUGE corresponds to the SPS bit which doesn't really tells
that's it is a huge page but only that it is not a 4k page.
Name it _PAGE_SPS instead of _PAGE_HUGE
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In order to avoid multiple conversions, handover directly a
pgprot_t to map_kernel_page() as already done for radix.
Do the same for __ioremap_caller() and __ioremap_at().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Set PAGE_KERNEL directly in the caller and do not rely on a
hack adding PAGE_KERNEL flags when _PAGE_PRESENT is not set.
As already done for PPC64, use pgprot_cache() helpers instead of
_PAGE_XXX flags in PPC32 ioremap() derived functions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In many places, ioremap_prot() and __ioremap() can be replaced with
higher level functions like ioremap(), ioremap_coherent(),
ioremap_cache(), ioremap_wc() ...
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Live Partition Migrations require all the present CPUs to execute the
H_JOIN call, and hence rtas_ibm_suspend_me() onlines any offline CPUs
before initiating the migration for this purpose.
The commit 85a88cabad
("powerpc/pseries: Disable CPU hotplug across migrations")
disables any CPU-hotplug operations once all the offline CPUs are
brought online to prevent any further state change. Once the
CPU-Hotplug operation is disabled, the code assumes that all the CPUs
are online.
However, there is a minor window in rtas_ibm_suspend_me() between
onlining the offline CPUs and disabling CPU-Hotplug when a concurrent
CPU-offline operations initiated by the userspace can succeed thereby
nullifying the the aformentioned assumption. In this unlikely case
these offlined CPUs will not call H_JOIN, resulting in a system hang.
Fix this by verifying that all the present CPUs are actually online
after CPU-Hotplug has been disabled, failing which we restore the
state of the offline CPUs in rtas_ibm_suspend_me() and return an
-EBUSY.
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently on POWER9 SMT8 cores systems, in sysfs, we report the
shared_cache_map for L1 caches (both data and instruction) to be the
cpu-ids of the threads in SMT8 cores. This is incorrect since on
POWER9 SMT8 cores there are two groups of threads, each of which
shares its own L1 cache.
This patch addresses this by reporting the shared_cpu_map correctly in
sysfs for L1 caches.
Before the patch
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 000000ff
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 000000ff
/sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000ff
/sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000ff
After the patch
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 00000055
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 00000055
/sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000aa
/sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000aa
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 SMT8 cores consist of two groups of threads, where threads in
each group shares L1-cache. The scheduler is not aware of this
distinction as the current sched-domain hierarchy has all the threads
of the core defined at the SMT domain.
SMT [Thread siblings of the SMT8 core]
DIE [CPUs in the same die]
NUMA [All the CPUs in the system]
Due to this, we can observe run-to-run variance when we run a
multi-threaded benchmark bound to a single core based on how the
scheduler spreads the software threads across the two groups in the
core.
We fix this in this patch by defining each group of threads which
share L1-cache to be the SMT level. The group of threads in the SMT8
core is defined to be the CACHE level. The sched-domain hierarchy
after this patch will be :
SMT [Thread siblings in the core that share L1 cache]
CACHE [Thread siblings that are in the SMT8 core]
DIE [CPUs in the same die]
NUMA [All the CPUs in the system]
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On IBM POWER9, the device tree exposes a property array identifed by
"ibm,thread-groups" which will indicate which groups of threads share
a particular set of resources.
As of today we only have one form of grouping identifying the group of
threads in the core that share the L1 cache, translation cache and
instruction data flow.
This patch adds helper functions to parse the contents of
"ibm,thread-groups" and populate a per-cpu variable to cache
information about siblings of each CPU that share the L1, traslation
cache and instruction data-flow.
It also defines a new global variable named "has_big_cores" which
indicates if the cores on this configuration have multiple groups of
threads that share L1 cache.
For each online CPU, it maintains a cpu_smallcore_mask, which
indicates the online siblings which share the L1-cache with it.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 6c1719942e ("powerpc/of: Remove useless register save/restore
when calling OF back") removed the saving of srr0 and srr1 when calling
into OpenFirmware. Commit e31aa453bb ("powerpc: Use LOAD_REG_IMMEDIATE
only for constants on 64-bit") did the same for rtas.
This means we don't need to save the extra stack space and can use
the common SWITCH_FRAME_SIZE.
There were already no users of _SRR0 and _SRR1 so we can remove them
too.
Link: https://github.com/linuxppc/linux/issues/83
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powerpc mobility code may receive RTAS requests to perform PRRN
(Platform Resource Reassignment Notification) topology changes at any
time, including during LPAR migration operations.
In some configurations where the affinity of CPUs or memory is being
changed on that platform, the PRRN requests may apply or refer to
outdated information prior to the complete update of the device-tree.
This patch changes the duration for which topology updates are
suppressed during LPAR migrations from just the rtas_ibm_suspend_me()
/ 'ibm,suspend-me' call(s) to cover the entire migration_store()
operation to allow all changes to the device-tree to be applied prior
to accepting and applying any PRRN requests.
For tracking purposes, pr_info notices are added to the functions
start_topology_update() and stop_topology_update() of 'numa.c'.
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rather than mixing "if (state)" blocks and gotos, convert entirely to
"if (state)" blocks to make the state machine behaviour clearer.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The wait_state member of eeh_ops does not need to be platform
dependent; it's just logic around eeh_ops.get_state(). Therefore,
merge the two (slightly different!) platform versions into a new
function, eeh_wait_state() and remove the eeh_ops member.
While doing this, also correct:
* The wait logic, so that it never waits longer than max_wait.
* The wait logic, so that it never waits less than
EEH_STATE_MIN_WAIT_TIME.
* One call site where the result is treated like a bit field before
it's checked for negative error values.
* In pseries_eeh_get_state(), rename the "state" parameter to "delay"
because that's what it is.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, eeh_pe_state_mark() marks a PE (and it's children) with a
state and then performs additional processing if that state included
EEH_PE_ISOLATED.
The state parameter is always a constant at the call site, so
rearrange eeh_pe_state_mark() into two functions and just call the
appropriate one at each site.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function eeh_pe_state_mark_with_cfg() just performs the work of
eeh_pe_state_mark() and then, conditionally, the work of
eeh_pe_state_clear(). However it is only ever called with a constant
state such that the condition is always true, so replace it by direct
calls.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>