We have several 44x machine check handlers defined in traps.c. It would
be preferable if they were split out with the platforms that use them.
Do that.
In the process, drop get_mc_reason() and instead just open code the
lookup of reason from regs->dsisr. This avoids a pointless layer of
abstraction.
We know to use regs->dsisr because 44x enables BOOKE which enables
PPC_ADV_DEBUG_REGS, and FSL_BOOKE is not enabled on 44x builds.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we build the 47x cputable entries even when CONFIG_PPC_47x is
disabled. That means a kernel built without CONFIG_PPC_47x will claim to
support a 47x CPU and start booting, only to break somewhere later
because it doesn't have 47x support compiled in.
So guard the 47x cputable entries with CONFIG_PPC_47x. Note that this is
inside the #ifdef CONFIG_44x section, because 47x depends on 44x.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds an irq counter for the watchdog soft-NMI. This interrupt
only fires when interrupts are soft-disabled, so it will not
increment much even when the watchdog is running. However it's
useful for debugging and sanity checking.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The powerpc kernel/watchdog.o should be built when HARDLOCKUP_DETECTOR
and HAVE_HARDLOCKUP_DETECTOR_ARCH are both selected. If only the former
is selected, then the generic perf watchdog has been selected.
To simplify this check, introduce a new Kconfig symbol PPC_WATCHDOG that
depends on both. This Kconfig option means the powerpc specific
watchdog is enabled.
Without this patch, Book3E will attempt to build the powerpc watchdog.
Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On 64-bit Book3s, when we're in HV mode, we have already counted the
machine check exception in machine_check_early().
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use IS_ENABLED() rather than an #ifdef]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
binutils >= 2.26 now warns about misuse of register expressions in
assembler operands that are actually literals, for example:
arch/powerpc/kernel/entry_64.S:535: Warning: invalid register expression
In practice these are almost all uses of r0 that should just be a
literal 0.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
[mpe: Mention r0 is almost always the culprit, fold in purgatory change]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When CPUs start and stop the watchdog, they manipulate shared data
that is normally protected by the lock. Other CPUs can be running
concurrently at this time, so it's a good idea to use locking here
to be on the safe side.
Remove the barrier which is undocumented and didn't do anything.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the SMP detector finds other CPUs stuck, it iterates over
them and marks them as stuck. This pulls them out of the pending
mask and allows the detector to continue with remaining good
CPUs (if nmi_watchdog=panic is not enabled).
The code to dothat was buggy because when setting a CPU stuck,
if the pending mask became empty, it resets it to keep the
watchdog running. However the iterator will continue to run
over the new pending mask and mark remaining good CPUs sas stuck.
Fix this by doing it with cpumask bitwise operations.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the watchdog decides to panic, it takes the lock and double
checks everything (to avoid races with the CPU being unstuck or
panic()ed by something else).
The exit label was misplaced and would result in all-CPUs backtrace
and watchdog panic even in the case that the condition was found to be
resolved.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some code can go into a tight loop calling touch_nmi_watchdog (e.g.,
stop_machine CPU hotplug code). This can cause contention on watchdog
locks particularly if all CPUs with watchdog enabled are spinning in
the loops.
Avoid this storm of activity by running the watchdog timer callback
from this path if we have exceeded the timer period since it was last
run.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Hard-disable interrupts before taking the lock, which prevents
soft-NMI re-entrancy and therefore can prevent deadlocks.
- Use raw_ variants of local_irq_disable to avoid irq debugging.
- When the lock is contended, spin at low SMT priority, using
loads only, and with interrupts enabled (where possible).
Some stalls have been noticed at high loads that go away with improved
locking. There should not be so much locking contention in the first
place (which is addressed in a subsequent patch), but locking should
still be improved.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the NMI IPI lock is contended, spin at low SMT priority, using
loads only, and with interrupts enabled (where possible). This
improves behaviour under high contention (e.g., a system crash when
a number of CPUs are trying to enter the debugger).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit d300627c6a ("powerpc/6xx: Handle DABR match before calling
do_page_fault") breaks non 6xx platforms.
Failed to execute /init (error -14)
Starting init: /bin/sh exists but couldn't execute it (error -14)
Kernel panic - not syncing: No working init found. Try passing init= ...
CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc3-s3k-dev-00143-g7aa62e972a56 #56
Call Trace:
panic+0x108/0x250 (unreliable)
rootfs_mount+0x0/0x58
ret_from_kernel_thread+0x5c/0x64
Rebooting in 180 seconds..
This is because in handle_page_fault(), the call to do_page_fault() has been
mistakenly enclosed inside an #ifdef CONFIG_6xx
Fixes: d300627c6a ("powerpc/6xx: Handle DABR match before calling do_page_fault")
Brown-paper-bag-to-be-worn-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the decrementer wraps again and de-asserts the decrementer
exception while hard-disabled, __check_irq_replay() has a test to
notice the wrap when interrupts are re-enabled.
The decrementer check must be done when clearing the PACA_IRQ_HARD_DIS
flag, not when the PACA_IRQ_DEC flag is tested. Previously this worked
because the decrementer interrupt was always the first one checked
after clearing the hard disable flag, but HMI check was moved ahead of
that, which introduced this bug.
This can cause a missed decrementer interrupt if we soft-disable
interrupts then take an HMI which is recorded in irq_happened, then
hard-disable interrupts for > 4s to wrap the decrementer.
Fixes: e0e0d6b739 ("powerpc/64: Replay hypervisor maintenance interrupt first")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 DD2 PMU can stop after a state-loss idle in some conditions.
A solution is to set then clear MMCRA[60] after wake from state-loss
idle. MMCRA[60] is a non-architected bit, see the user manual for
details.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This uses the newly defined constants for this rather than open-coded
numbers. There is a side effect on 64-bit which is to pass through
some of the new P9 bits which we didn't before.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We test a number of bits from DSISR/SRR1 before deciding
to call hash_page(). If any of these is set, we go directly
to do_page_fault() as the bit indicate a fault that needs
to be handled there (no hashing needed).
This updates the current open-coded masks to use the new
DSISR definitions.
This *does* change the masks actually used in two ways:
- We used to test various bits that were defined as "always 0"
in the architecture and could be repurposed for something
else. From now on, we just ignore such bits.
- We were missing some new bits defined on P9
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On legacy 6xx 32-bit procesors, we checked for the DABR match bit
in DSISR from do_page_fault(), in the middle of a pile of ifdef's
because all other CPU types do it in assembly prior to calling
do_page_fault. Fix that.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Add #ifdef CONFIG_6xx]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
By filtering the relevant SRR1 bits in the assembly rather than
in do_page_fault() itself, we avoid a conditional branch (since we
already come from different path for data and instruction faults).
This will allow more simplifications later
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Replace the __this_cpu_read() with raw_cpu_read() in
iommu_range_alloc(). Otherwise we get a warning about using
__this_cpu_read() in preemptible code:
BUG: using __this_cpu_read() in preemptible
caller is iommu_range_alloc+0xa8/0x3d0
Preemption doesn't need to be disabled since according to the comment
any CPU can safely use any IOMMU pool.
Signed-off-by: Victor Aoqui <victora@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The stop4 idle state on POWER9 is a deep idle state which loses
hypervisor resources, but whose latency is low enough that it can be
exposed via cpuidle.
Until now, the deep idle states which lose hypervisor resources (eg:
winkle) were only exposed via CPU-Hotplug. Hence currently on wakeup
from such states, barring a few SPRs which need to be restored to
their older value, rest of the SPRS are reinitialized to their values
corresponding to that at boot time.
When stop4 is used in the context of cpuidle, we want these additional
SPRs to be restored to their older value, to ensure that the context
on the CPU coming back from idle is same as it was before going idle.
In this patch, we define a SPR save area in PACA (since we have used
up the volatile register space in the stack) and on POWER9, we restore
SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
SPRN_MMCR2 to the values they had before entering stop.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The watchdog soft-NMI exception stack setup loads a stack pointer
twice, which is an obvious error. It ends up using the system reset
interrupt (true-NMI) stack, which is also a bug because the watchdog
could be preempted by a system reset interrupt that overwrites the
NMI stack.
Change the soft-NMI to use the "emergency stack". The current kernel
stack is not used, because of the longer-term goal to prevent
asynchronous stack access using soft-disable.
Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The fixes branch is based off a random pre-rc1 commit, because we had
some fixes that needed to go in before rc1 was released.
However we now need to fix some code that went in after that point, but
before rc1, so merge rc1 to get that code into fixes so we can fix it!
In smp_cpus_done() we need to call smp_ops->setup_cpu() for the boot
CPU, which means it has to run *on* the boot CPU.
In the past we ensured it ran on the boot CPU by changing the CPU
affinity mask of current directly. That was removed in commit
6d11b87d55 ("powerpc/smp: Replace open coded task affinity logic"),
and replaced with a work queue call.
Unfortunately using a work queue leads to a lockdep warning, now that
the CPU hotplug lock is a regular semaphore:
======================================================
WARNING: possible circular locking dependency detected
...
kworker/0:1/971 is trying to acquire lock:
(cpu_hotplug_lock.rw_sem){++++++}, at: [<c000000000100974>] apply_workqueue_attrs+0x34/0xa0
but task is already holding lock:
((&wfc.work)){+.+.+.}, at: [<c0000000000fdb2c>] process_one_work+0x25c/0x800
...
CPU0 CPU1
---- ----
lock((&wfc.work));
lock(cpu_hotplug_lock.rw_sem);
lock((&wfc.work));
lock(cpu_hotplug_lock.rw_sem);
Although the deadlock can't happen in practice, because
smp_cpus_done() only runs in early boot before CPU hotplug is allowed,
lockdep can't tell that.
Luckily in commit 8fb12156b8 ("init: Pin init task to the boot CPU,
initially") tglx changed the generic code to pin init to the boot CPU
to begin with. The unpinning of init from the boot CPU happens in
sched_init_smp(), which is called after smp_cpus_done().
So smp_cpus_done() is always called on the boot CPU, which means we
don't need the work queue call at all - and the lockdep warning goes
away.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Currently flush_tmregs_to_thread() does not save the TM SPRs (TFHAR,
TFIAR, TEXASR) to the thread struct, unless the process is currently
inside a suspended transaction.
If the process is core dumping, and the TM SPRs have changed since the
last time the process was context switched, then we will save stale
values of the TM SPRs to the core dump.
Fix it by saving the live register state to the thread struct in that
case.
Fixes: 08e1c01d6a ("powerpc/ptrace: Enable support for TM SPR state")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
struct siginfo is a union and the kernel since 2.4 has been hiding a union
tag in the high 16bits of si_code using the values:
__SI_KILL
__SI_TIMER
__SI_POLL
__SI_FAULT
__SI_CHLD
__SI_RT
__SI_MESGQ
__SI_SYS
While this looks plausible on the surface, in practice this situation has
not worked well.
- Injected positive signals are not copied to user space properly
unless they have these magic high bits set.
- Injected positive signals are not reported properly by signalfd
unless they have these magic high bits set.
- These kernel internal values leaked to userspace via ptrace_peek_siginfo
- It was possible to inject these kernel internal values and cause the
the kernel to misbehave.
- Kernel developers got confused and expected these kernel internal values
in userspace in kernel self tests.
- Kernel developers got confused and set si_code to __SI_FAULT which
is SI_USER in userspace which causes userspace to think an ordinary user
sent the signal and that it was not kernel generated.
- The values make it impossible to reorganize the code to transform
siginfo_copy_to_user into a plain copy_to_user. As si_code must
be massaged before being passed to userspace.
So remove these kernel internal si codes and make the kernel code simpler
and more maintainable.
To replace these kernel internal magic si_codes introduce the helper
function siginfo_layout, that takes a signal number and an si_code and
computes which union member of siginfo is being used. Have
siginfo_layout return an enumeration so that gcc will have enough
information to warn if a switch statement does not handle all of union
members.
A couple of architectures have a messed up ABI that defines signal
specific duplications of SI_USER which causes more special cases in
siginfo_layout than I would like. The good news is only problem
architectures pay the cost.
Update all of the code that used the previous magic __SI_ values to
use the new SIL_ values and to call siginfo_layout to get those
values. Escept where not all of the cases are handled remove the
defaults in the switch statements so that if a new case is missed in
the future the lack will show up at compile time.
Modify the code that copies siginfo si_code to userspace to just copy
the value and not cast si_code to a short first. The high bits are no
longer used to hold a magic union member.
Fixup the siginfo header files to stop including the __SI_ values in
their constants and for the headers that were missing it to properly
update the number of si_codes for each signal type.
The fixes to copy_siginfo_from_user32 implementations has the
interesting property that several of them perviously should never have
worked as the __SI_ values they depended up where kernel internal.
With that dependency gone those implementations should work much
better.
The idea of not passing the __SI_ values out to userspace and then
not reinserting them has been tested with criu and criu worked without
changes.
Ref: 2.4.0-test1
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull powerpc fixes from Michael Ellerman:
"A handful of fixes, mostly for new code:
- some reworking of the new STRICT_KERNEL_RWX support to make sure we
also remove executable permission from __init memory before it's
freed.
- a fix to some recent optimisations to the hypercall entry where we
were clobbering r12, this was breaking nested guests (PR KVM).
- a fix for the recent patch to opal_configure_cores(). This could
break booting on bare metal Power8 boxes if the kernel was built
without CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG.
- .. and finally a workaround for spurious PMU interrupts on Power9
DD2.
Thanks to: Nicholas Piggin, Anton Blanchard, Balbir Singh"
* tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y
powerpc/mm/hash: Refactor hash__mark_rodata_ro()
powerpc/mm/radix: Refactor radix__mark_rodata_ro()
powerpc/64s: Fix hypercall entry clobbering r12 input
powerpc/perf: Avoid spurious PMU interrupts after idle
powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()
A previous optimisation incorrectly assumed the PAPR hcall does
not use r12, and clobbers it upon entry. In fact it is used as
an input. This can result in KVM guests crashing (observed with
PR KVM).
Instead of using r12 to save r13, tihs patch saves r13 in ctr.
This is more costly, but not as slow as using the SPRG.
Fixes: acd7d8cef0 ("powerpc/64s: Optimize hypercall/syscall entry")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 DD2 can see spurious PMU interrupts after state-loss idle in
some conditions.
A solution is to save and reload MMCR0 over state-loss idle.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull powerpc fixes from Michael Ellerman:
"Nothing that really stands out, just a bunch of fixes that have come
in in the last couple of weeks.
None of these are actually fixes for code that is new in 4.13. It's
roughly half older bugs, with fixes going to stable, and half
fixes/updates for Power9.
Thanks to: Aneesh Kumar K.V, Anton Blanchard, Balbir Singh, Benjamin
Herrenschmidt, Madhavan Srinivasan, Michael Neuling, Nicholas Piggin,
Oliver O'Halloran"
* tag 'powerpc-4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64: Fix atomic64_inc_not_zero() to return an int
powerpc: Fix emulation of mfocrf in emulate_step()
powerpc: Fix emulation of mcrf in emulate_step()
powerpc/perf: Add POWER9 alternate PM_RUN_CYC and PM_RUN_INST_CMPL events
powerpc/perf: Fix SDAR_MODE value for continous sampling on Power9
powerpc/asm: Mark cr0 as clobbered in mftb()
powerpc/powernv: Fix local TLB flush for boot and MCE on POWER9
powerpc/mm/radix: Synchronize updates to the process table
powerpc/mm/radix: Properly clear process table entry
powerpc/powernv: Tell OPAL about our MMU mode on POWER9
powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR
Implement an arch-speicfic watchdog rather than use the perf-based
hardlockup detector.
The new watchdog takes the soft-NMI directly, rather than going through
perf. Perf interrupts are to be made maskable in future, so that would
prevent the perf detector from working in those regions.
Additionally, implement a SMP based detector where all CPUs watch one
another by pinging a shared cpumask. This is because powerpc Book3S
does not have a true periodic local NMI, but some platforms do implement
a true NMI IPI.
If a CPU is stuck with interrupts hard disabled, the soft-NMI watchdog
does not work, but the SMP watchdog will. Even on platforms without a
true NMI IPI to get a good trace from the stuck CPU, other CPUs will
notice the lockup sufficiently to report it and panic.
[npiggin@gmail.com: honor watchdog disable at boot/hotplug]
Link: http://lkml.kernel.org/r/20170621001346.5bb337c9@roar.ozlabs.ibm.com
[npiggin@gmail.com: fix false positive warning at CPU unplug]
Link: http://lkml.kernel.org/r/20170630080740.20766-1-npiggin@gmail.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170616065715.18390-6-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Tested-by: Babu Moger <babu.moger@oracle.com> [sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.
LOCKUP_DETECTOR implies the general boot, sysctl, and programming
interfaces for the lockup detectors.
An architecture that wants to use a hard lockup detector must define
HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.
Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
does not implement the LOCKUP_DETECTOR interfaces.
sparc is unusual in that it has started to implement some of the
interfaces, but not fully yet. It should probably be converted to a full
HAVE_HARDLOCKUP_DETECTOR_ARCH.
[npiggin@gmail.com: fix]
Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Tested-by: Babu Moger <babu.moger@oracle.com> [sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two cases outside the normal address space management
where a CPU's local TLB is to be flushed:
1. Host boot; in case something has left stale entries in the
TLB (e.g., kexec).
2. Machine check; to clean corrupted TLB entries.
CPU state restore from deep idle states also flushes the TLB.
However this seems to be a side effect of reusing the boot code to set
CPU state, rather than a requirement itself.
The current flushing has a number of problems with ISA v3.0B:
- The current radix mode of the MMU is not taken into account. tlbiel
is undefined if the R field does not match the current radix mode.
- ISA v3.0B hash must flush the partition and process table caches.
- ISA v3.0B radix must flush partition and process scoped translations,
partition and process table caches, and also the page walk cache.
Add POWER9 cases to handle these, with radix vs hash determined by the
host MMU mode.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch fixes a crash seen while doing a kexec from radix mode to
hash mode. Key 0 is special in hash and used in the RPN by default, we
set the key values to 0 today. In radix mode key 0 is used to control
supervisor<->user access. In hash key 0 is used by default, so the
first instruction after the switch causes a crash on kexec.
Commit 3b10d0095a ("powerpc/mm/radix: Prevent kernel execution of
user space") introduced the setting of IAMR and AMOR values to prevent
execution of user mode instructions from supervisor mode. We need to
clean up these SPR's on kexec.
Fixes: 3b10d0095a ("powerpc/mm/radix: Prevent kernel execution of user space")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull perf fixes from Thomas Gleixner:
"A couple of fixes for perf and kprobes:
- Add he missing exclude_kernel attribute for the precise_ip level so
!CAP_SYS_ADMIN users get the proper results.
- Warn instead of failing completely when perf has no unwind support
for a particular architectiure built in.
- Ensure that jprobes are at function entry and not at some random
place"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Ensure that jprobe probepoints are at function entry
kprobes: Simplify register_jprobes()
kprobes: Rename [arch_]function_offset_within_entry() to [arch_]kprobe_on_func_entry()
perf unwind: Do not fail due to missing unwind support
perf evsel: Set attr.exclude_kernel when probing max attr.precise_ip
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
- Platform support for FSP2 (476fpe) board
- Enable ZONE_DEVICE on 64-bit server CPUs.
- Generic & powerpc spin loop primitives to optimise busy waiting
- Convert VDSO update function to use new update_vsyscall() interface
- Optimisations to hypercall/syscall/context-switch paths
- Improvements to the CPU idle code on Power8 and Power9.
As well as many other fixes and improvements.
Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
Bauermann, Yang Li"
* tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
powerpc/mm/hash: Implement mark_rodata_ro() for hash
powerpc/vmlinux.lds: Align __init_begin to 16M
powerpc/lib/code-patching: Use alternate map for patch_instruction()
powerpc/xmon: Add patch_instruction() support for xmon
powerpc/kprobes/optprobes: Use patch_instruction()
powerpc/kprobes: Move kprobes over to patch_instruction()
powerpc/mm/radix: Fix execute permissions for interrupt_vectors
powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
powerpc/64s: Blacklist rtas entry/exit from kprobes
powerpc/64s: Blacklist functions invoked on a trap
powerpc/64s: Un-blacklist system_call() from kprobes
powerpc/64s: Move system_call() symbol to just after setting MSR_EE
powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
powerpc/64s: Convert .L__replay_interrupt_return to a local label
powerpc64/elfv1: Only dereference function descriptor for non-text symbols
cxl: Export library to support IBM XSL
powerpc/dts: Use #include "..." to include local DT
powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
...
Pull dma-mapping infrastructure from Christoph Hellwig:
"This is the first pull request for the new dma-mapping subsystem
In this new subsystem we'll try to properly maintain all the generic
code related to dma-mapping, and will further consolidate arch code
into common helpers.
This pull request contains:
- removal of the DMA_ERROR_CODE macro, replacing it with calls to
->mapping_error so that the dma_map_ops instances are more self
contained and can be shared across architectures (me)
- removal of the ->set_dma_mask method, which duplicates the
->dma_capable one in terms of functionality, but requires more
duplicate code.
- various updates for the coherent dma pool and related arm code
(Vladimir)
- various smaller cleanups (me)"
* tag 'dma-mapping-4.13' of git://git.infradead.org/users/hch/dma-mapping: (56 commits)
ARM: dma-mapping: Remove traces of NOMMU code
ARM: NOMMU: Set ARM_DMA_MEM_BUFFERABLE for M-class cpus
ARM: NOMMU: Introduce dma operations for noMMU
drivers: dma-mapping: allow dma_common_mmap() for NOMMU
drivers: dma-coherent: Introduce default DMA pool
drivers: dma-coherent: Account dma_pfn_offset when used with device tree
dma: Take into account dma_pfn_offset
dma-mapping: replace dmam_alloc_noncoherent with dmam_alloc_attrs
dma-mapping: remove dmam_free_noncoherent
crypto: qat - avoid an uninitialized variable warning
au1100fb: remove a bogus dma_free_nonconsistent call
MAINTAINERS: add entry for dma mapping helpers
powerpc: merge __dma_set_mask into dma_set_mask
dma-mapping: remove the set_dma_mask method
powerpc/cell: use the dma_supported method for ops switching
powerpc/cell: clean up fixed mapping dma_ops initialization
tile: remove dma_supported and mapping_error methods
xen-swiotlb: remove xen_swiotlb_set_dma_mask
arm: implement ->dma_supported instead of ->set_dma_mask
mips/loongson64: implement ->dma_supported instead of ->set_dma_mask
...
Pull KVM updates from Paolo Bonzini:
"PPC:
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts pending.
ARM:
- VCPU request overhaul
- allow timer and PMU to have their interrupt number selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups
s390:
- initial machine check forwarding
- migration support for the CMMA page hinting information
- cleanups and fixes
x86:
- nested VMX bugfixes and improvements
- more reliable NMI window detection on AMD
- APIC timer optimizations
Generic:
- VCPU request overhaul + documentation of common code patterns
- kvm_stat improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
Update my email address
kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12
kvm: x86: mmu: allow A/D bits to be disabled in an mmu
x86: kvm: mmu: make spte mmio mask more explicit
x86: kvm: mmu: dead code thanks to access tracking
KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
KVM: x86: remove ignored type attribute
KVM: LAPIC: Fix lapic timer injection delay
KVM: lapic: reorganize restart_apic_timer
KVM: lapic: reorganize start_hv_timer
kvm: nVMX: Check memory operand to INVVPID
KVM: s390: Inject machine check into the nested guest
KVM: s390: Inject machine check into the guest
tools/kvm_stat: add new interactive command 'b'
tools/kvm_stat: add new command line switch '-i'
tools/kvm_stat: fix error on interactive command 'g'
KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit
...
Pull pstore updates from Kees Cook:
"Various fixes and tweaks for the pstore subsystem.
Highlights:
- use memdup_user() instead of open-coded copies (Geliang Tang)
- fix record memory leak during initialization (Douglas Anderson)
- avoid confused compressed record warning (Ankit Kumar)
- prepopulate record timestamp and remove redundant logic from
backends"
* tag 'pstore-v4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
powerpc/nvram: use memdup_user
pstore: use memdup_user
pstore: Fix format string to use %u for record id
pstore: Populate pstore record->time field
pstore: Create common record initializer
efi-pstore: Refactor erase routine
pstore: Avoid potential infinite loop
pstore: Fix leaked pstore_record in pstore_get_backend_records()
pstore: Don't warn if data is uncompressed and type is not PSTORE_TYPE_DMESG
Pull SMP hotplug updates from Thomas Gleixner:
"This update is primarily a cleanup of the CPU hotplug locking code.
The hotplug locking mechanism is an open coded RWSEM, which allows
recursive locking. The main problem with that is the recursive nature
as it evades the full lockdep coverage and hides potential deadlocks.
The rework replaces the open coded RWSEM with a percpu RWSEM and
establishes full lockdep coverage that way.
The bulk of the changes fix up recursive locking issues and address
the now fully reported potential deadlocks all over the place. Some of
these deadlocks have been observed in the RT tree, but on mainline the
probability was low enough to hide them away."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
cpu/hotplug: Constify attribute_group structures
powerpc: Only obtain cpu_hotplug_lock if called by rtasd
ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init
cpu/hotplug: Remove unused check_for_tasks() function
perf/core: Don't release cred_guard_mutex if not taken
cpuhotplug: Link lock stacks for hotplug callbacks
acpi/processor: Prevent cpu hotplug deadlock
sched: Provide is_percpu_thread() helper
cpu/hotplug: Convert hotplug locking to percpu rwsem
s390: Prevent hotplug rwsem recursion
arm: Prevent hotplug rwsem recursion
arm64: Prevent cpu hotplug rwsem recursion
kprobes: Cure hotplug lock ordering issues
jump_label: Reorder hotplug lock and jump_label_lock
perf/tracing/cpuhotplug: Fix locking order
ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus()
PCI: Replace the racy recursion prevention
PCI: Use cpu_hotplug_disable() instead of get_online_cpus()
perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode()
x86/perf: Drop EXPORT of perf_check_microcode
...
For CONFIG_STRICT_KERNEL_RWX align __init_begin to 16M. We use 16M
since its the larger of 2M on radix and 16M on hash for our linear
mapping. The plan is to have .text, .rodata and everything upto
__init_begin marked as RX. Note we still have executable read only
data. We could further align rodata to another 16M boundary. I've used
keeping text plus rodata as read-only-executable as a trade-off to
doing read-only-executable for text and read-only for rodata.
We don't use multi PT_LOAD in PHDRS because we are not sure if all
bootloaders support them. This patch keeps PHDRS in vmlinux.lds.S as
the same they are with just one PT_LOAD for all of the kernel marked
as RWX (7).
mpe: What this means is the added alignment bloats the resulting
binary on disk, a powernv kernel goes from 17M to 22M.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So that we can implement STRICT_RWX, use patch_instruction() in
optprobes.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch_arm/disarm_probe() use direct assignment for copying
instructions, replace them with patch_instruction(). We don't need to
call flush_icache_range() because patch_instruction() does it for us.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We can't take traps with relocation off, so blacklist enter_rtas() and
rtas_return_loc(). However, instead of blacklisting all of enter_rtas(),
introduce a new symbol __enter_rtas from where on we can't take a trap
and blacklist that.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Blacklist all functions involved while handling a trap. We:
- convert some of the symbols into private symbols, and
- blacklist most functions involved while handling a trap.
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>