Commit Graph

20073 Commits

Author SHA1 Message Date
Paul Mackerras
fd0944baad KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits.  This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
9a94d3ee2d KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format.  It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
32eb150aee KVM: PPC: Book3S HV: Handle hypervisor instruction faults better
Currently the code for handling hypervisor instruction page faults
passes 0 for the flags indicating the type of fault, which is OK in
the usual case that the page is not mapped in the partition-scoped
page tables.  However, there are other causes for hypervisor
instruction page faults, such as not being to update a reference
(R) or change (C) bit.  The cause is indicated in bits in HSRR1,
including a bit which indicates that the fault is due to not being
able to write to a page (for example to update an R or C bit).
Not handling these other kinds of faults correctly can lead to a
loop of continual faults without forward progress in the guest.

In order to handle these faults better, this patch constructs a
"DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
have in common, and passes it to kvmppc_book3s_hv_page_fault() so
that it knows what caused the fault.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
95a6432ce9 KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.

The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered.  On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE.  At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.

The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix CONFIG_ALTIVEC=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
53655ddd77 KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked
Currently kvmppc_handle_exit_hv() is called with the vcore lock held
because it is called within a for_each_runnable_thread loop.
However, we already unlock the vcore within kvmppc_handle_exit_hv()
under certain circumstances, and this is safe because (a) any vcpus
that become runnable and are added to the runnable set by
kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
run in the guest (because the vcore state is VCORE_EXITING), and
(b) for_each_runnable_thread is safe against addition or removal
of vcpus from the runnable set.

Therefore, in order to simplify things for following patches, let's
drop the vcore lock in the for_each_runnable_thread loop, so
kvmppc_handle_exit_hv() gets called without the vcore lock held.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
7854f7545b KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions.  This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved.  The
condition register and DSCR are now always preserved.

With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero.  So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.

Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9.  This enables the code to be simplified
quite a bit.

_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
df709a296e KVM: PPC: Book3S HV: Simplify real-mode interrupt handling
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest.  With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.

Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
41f4e631da KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable functions
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path.  The calling
convention is made compatible with C.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
f7035ce9f1 KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code
This is based on a patch by Suraj Jitindar Singh.

This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c.  This is to
make future maintenance and modification easier.  The algorithm
expressed in the C code is almost identical to the previous
algorithm.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
966eba9316 KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation
This removes code that clears the external interrupt pending bit in
the pending_exceptions bitmap.  This is left over from an earlier
iteration of the code where this bit was set when an escalation
interrupt arrived in order to wake the vcpu from cede.  Currently
we set the vcpu->arch.irq_pending flag instead for this purpose.
Therefore there is no need to do anything with the pending_exceptions
bitmap.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
d24ea8a733 KVM: PPC: Book3S: Simplify external interrupt handling
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller).  The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.

In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller.  That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit.  This
patch does that.

With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt.  In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras
e7b17d5047 powerpc: Turn off CPU_FTR_P9_TM_HV_ASSIST in non-hypervisor mode
When doing nested virtualization, it is only necessary to do the
transactional memory hypervisor assist at level 0, that is, when
we are in hypervisor mode.  Nested hypervisors can just use the TM
facilities as architected.  Therefore we should clear the
CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode,
along with the CPU_FTR_HVMODE bit.

Doing this will not change anything at this stage because the only
code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently
can only be used when when CPU_FTR_HVMODE is set.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Alexey Kardashevskiy
a3ac077b75 KVM: PPC: Remove redundand permission bits removal
The kvmppc_gpa_to_ua() helper itself takes care of the permission
bits in the TCE and yet every single caller removes them.

This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
(which are GPAs + TCE permission bits) to make the callers simpler.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Alexey Kardashevskiy
2691f0ff3d KVM: PPC: Propagate errors to the guest when failed instead of ignoring
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update
the hardware tables, we print a warning once, clear the entry and
continue. This is so as at the time the assumption was that if
a VFIO device is hotplugged into the guest, and the userspace replays
virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails,
then there is nothing useful we can do about it.

However the assumption is not valid as these handlers are not called for
TCE replay (VFIO ioctl interface is used for that) and these handlers
are for new TCEs.

This returns an error to the guest if there is a request which cannot be
processed. By now the only possible failure must be H_TOO_HARD.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Alexey Kardashevskiy
42de7b9e21 KVM: PPC: Validate TCEs against preregistered memory page sizes
The userspace can request an arbitrary supported page size for a DMA
window and this works fine as long as the mapped memory is backed with
the pages of the same or bigger size; if this is not the case,
mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
dangerously incorrect TCEs.

However since it is quite easy to misconfigure the KVM and we do not do
reverts to all changes made to TCE tables if an error happens in a middle,
we better do the acceptable page size validation before we even touch
the tables.

This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
against the preregistered memory page sizes.

Since the new check uses real/virtual mode helpers, this renames
kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
case and mirrors it for the virtual mode under the old name. The real
mode handler is not used for the virtual mode as:
1. it uses _lockless() list traversing primitives instead of RCU;
2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
virtual mode does not have to use and since on POWER9+radix only virtual
mode handlers actually work, we do not want to slow down that path even
a bit.

This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
are static now.

From now on the attempts on mapping IOMMU pages bigger than allowed
will result in KVM exit.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix KVM_HV=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 15:45:15 +11:00
Greg Kroah-Hartman
4e1a606d55 Merge 4.19-rc7 into tty-next
We want the fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-08 15:43:12 +02:00
Finn Thain
0792a2c8e0 macintosh: Use common code to access RTC
Now that the 68k Mac port has adopted the via-pmu driver, the same RTC
code can be shared between m68k and powerpc. Replace duplicated code in
arch/powerpc and arch/m68k with common RTC accessors for Cuda and PMU.

Drop the problematic WARN_ON which was introduced in commit 22db552b50
("powerpc/powermac: Fix rtc read/write functions").

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-08 22:53:10 +11:00
Greg Kroah-Hartman
cd2093cb45 Merge tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Michael writes:
  "powerpc fixes for 4.19 #4

   Four regression fixes.

   A fix for a change to lib/xz which broke our zImage loader when
   building with XZ compression. OK'ed by Herbert who merged the
   original patch.

   The recent fix we did to avoid patching __init text broke some 32-bit
   machines, fix that.

   Our show_user_instructions() could be tricked into printing kernel
   memory, add a check to avoid that.

   And a fix for a change to our NUMA initialisation logic, which causes
   crashes in some kdump configurations.

   Thanks to:
     Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis
     Roos, Murilo Opsfelder Araujo, Srikar Dronamraju."

* tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/numa: Skip onlining a offline node in kdump path
  powerpc: Don't print kernel instructions in show_user_instructions()
  powerpc/lib: fix book3s/32 boot failure due to code patching
  lib/xz: Put CRC32_POLY_LE in xz_private.h
2018-10-07 07:05:43 +02:00
Greg Kroah-Hartman
08b297bb10 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Paolo writes:
  "KVM changes for 4.19-rc7

   x86 and PPC bugfixes, mostly introduced in 4.19-rc1."

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: nVMX: fix entry with pending interrupt if APICv is enabled
  KVM: VMX: hide flexpriority from guest when disabled at the module level
  KVM: VMX: check for existence of secondary exec controls before accessing
  KVM: PPC: Book3S HV: Avoid crash from THP collapse during radix page fault
  KVM: x86: fix L1TF's MMIO GFN calculation
  tools/kvm_stat: cut down decimal places in update interval dialog
  KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS
  KVM: x86: Do not use kvm_x86_ops->mpx_supported() directly
  KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled
  KVM: x86: never trap MSR_KERNEL_GS_BASE
2018-10-05 08:29:44 -07:00
Srikar Dronamraju
ac1788cc7d powerpc/numa: Skip onlining a offline node in kdump path
With commit 2ea6263068 ("powerpc/topology: Get topology for shared
processors at boot"), kdump kernel on shared LPAR may crash.

The necessary conditions are
- Shared LPAR with at least 2 nodes having memory and CPUs.
- Memory requirement for kdump kernel must be met by the first N-1
  nodes where there are at least N nodes with memory and CPUs.

Example numactl of such a machine.
  $ numactl -H
  available: 5 nodes (0,2,5-7)
  node 0 cpus:
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 2 cpus:
  node 2 size: 255 MB
  node 2 free: 189 MB
  node 5 cpus: 24 25 26 27 28 29 30 31
  node 5 size: 4095 MB
  node 5 free: 4024 MB
  node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
  node 6 size: 6353 MB
  node 6 free: 5998 MB
  node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
  node 7 size: 7640 MB
  node 7 free: 7164 MB
  node distances:
  node   0   2   5   6   7
    0:  10  40  40  40  40
    2:  40  10  40  40  40
    5:  40  40  10  40  40
    6:  40  40  40  10  20
    7:  40  40  40  20  10

Steps to reproduce.
1. Load / start kdump service.
2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)

When booting a kdump kernel with 2048M:

  kexec: Starting switchover sequence.
  I'm in purgatory
  Using 1TB segments
  hash-mmu: Initializing hash mmu with SLB
  Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
  Found initrd at 0xc000000009e70000:0xc00000000ae554b4
  Using pSeries machine description
  -----------------------------------------------------
  ppc64_pft_size    = 0x1e
  phys_mem_size     = 0x88000000
  dcache_bsize      = 0x80
  icache_bsize      = 0x80
  cpu_features      = 0x000000ff8f5d91a7
    possible        = 0x0000fbffcf5fb1a7
    always          = 0x0000006f8b5c91a1
  cpu_user_features = 0xdc0065c2 0xef000000
  mmu_features      = 0x7c006001
  firmware_features = 0x00000007c45bfc57
  htab_hash_mask    = 0x7fffff
  physical_start    = 0x8000000
  -----------------------------------------------------
  numa:   NODE_DATA [mem 0x87d5e300-0x87d67fff]
  numa:     NODE_DATA(0) on node 6
  numa:   NODE_DATA [mem 0x87d54600-0x87d5e2ff]
  Top of RAM: 0x88000000, Total RAM: 0x88000000
  Memory hole size: 0MB
  Zone ranges:
    DMA      [mem 0x0000000000000000-0x0000000087ffffff]
    DMA32    empty
    Normal   empty
  Movable zone start for each node
  Early memory node ranges
    node   6: [mem 0x0000000000000000-0x0000000087ffffff]
  Could not find start_pfn for node 0
  Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
  On node 0 totalpages: 0
  Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
  On node 6 totalpages: 34816

  Unable to handle kernel paging request for data at address 0x00000060
  Faulting instruction address: 0xc000000008703a54
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
  NIP:  c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
  REGS: c00000000b673440 TRAP: 0380   Not tainted  (4.19.0-rc5-master+)
  MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24022022  XER: 20000002
  CFAR: c0000000086fc238 IRQMASK: 0
  GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
  GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
  GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
  GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
  GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
  GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
  GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
  GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
  NIP [c000000008703a54] bus_add_device+0x84/0x1e0
  LR [c000000008703a38] bus_add_device+0x68/0x1e0
  Call Trace:
  [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
  [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
  [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
  [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
  [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
  [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
  [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
  [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
  [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
  [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
  [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
  [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
  Instruction dump:
  60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
  e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
  ---[ end trace 593577668c2daa65 ]---

However a regular kernel with 4096M (2048 gets reserved for crash
kernel) boots properly.

Unlike regular kernels, which mark all available nodes as online,
kdump kernel only marks just enough nodes as online and marks the rest
as offline at boot. However kdump kernel boots with all available
CPUs. With Commit 2ea6263068 ("powerpc/topology: Get topology for
shared processors at boot"), all CPUs are onlined on their respective
nodes at boot time. try_online_node() tries to online the offline
nodes but fails as all needed subsystems are not yet initialized.

As part of fix, detect and skip early onlining of a offline node.

Fixes: 2ea6263068 ("powerpc/topology: Get topology for shared processors at boot")
Reported-by: Pavithra Prakash <pavrampu@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-05 23:21:54 +10:00
Michael Ellerman
a932ed3b71 powerpc: Don't print kernel instructions in show_user_instructions()
Recently we implemented show_user_instructions() which dumps the code
around the NIP when a user space process dies with an unhandled
signal. This was modelled on the x86 code, and we even went so far as
to implement the exact same bug, namely that if the user process
crashed with its NIP pointing into the kernel we will dump kernel text
to dmesg. eg:

  bad-bctr[2996]: segfault (11) at c000000000010000 nip c000000000010000 lr 12d0b0894 code 1
  bad-bctr[2996]: code: fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 fb810050 7f8802a6
  bad-bctr[2996]: code: 3860001c f8010080 48242371 60000000 <7c7b1b79> 4082002c e8010080 eb610048

This was discovered on x86 by Jann Horn and fixed in commit
342db04ae7 ("x86/dumpstack: Don't dump kernel memory based on usermode RIP").

Fix it by checking the adjusted NIP value (pc) and number of
instructions against USER_DS, and bail if we fail the check, eg:

  bad-bctr[2969]: segfault (11) at c000000000010000 nip c000000000010000 lr 107930894 code 1
  bad-bctr[2969]: Bad NIP, not dumping instructions.

Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-05 23:18:12 +10:00
Paul Mackerras
aa2278644a KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same VM
This adds a mode where the vcore scheduling logic in HV KVM limits itself
to scheduling only virtual cores from the same VM on any given physical
core.  This is enabled via a new module parameter on the kvm-hv module
called "one_vm_per_core".  For this to work on POWER9, it is necessary to
set indep_threads_mode=N.  (On POWER8, hardware limitations mean that KVM
is never in independent threads mode, regardless of the indep_threads_mode
setting.)

Thus the settings needed for this to work are:

1. The host is in SMT1 mode.
2. On POWER8, the host is not in 2-way or 4-way static split-core mode.
3. On POWER9, the indep_threads_mode parameter is N.
4. The one_vm_per_core parameter is Y.

With these settings, KVM can run up to 4 vcpus on a core at the same
time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest
threading mode), and will ensure that all of the vcpus belong to the
same VM.

This is intended for use in security-conscious settings where users are
concerned about possible side-channel attacks between threads which could
perhaps enable one VM to attack another VM on the same core, or the host.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-05 15:55:06 +10:00
Cameron Kaiser
1006284c5e KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR
When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.

To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.

Signed-off-by: Cameron Kaiser <spectre@floodgap.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-05 15:55:06 +10:00
Rob Herring
4355151de4 Merge branch 'all-dtbs' into dt/next 2018-10-04 14:16:15 -05:00
Nicholas Piggin
053c5a753e powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation
Local radix TLB flush operations that operate on congruence classes
have explicit ERAT flushes for POWER9. The process scoped LPID flush
did not have a flush, so add it.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 23:16:53 +10:00
Nicholas Piggin
bc276ecba1 powerpc/64s/hash: Do not use PPC_INVALIDATE_ERAT on CPUs before POWER9
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.

Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.

Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 23:16:53 +10:00
Anton Blanchard
817593604e powerpc/time: Add set_state_oneshot_stopped decrementer callback
If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
0x7fffffff:

       if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
                set_dec(0x7fffffff);
        else
                set_dec(decrementer_max);

If there are no future events, we don't reprogram the decrementer
after this and we end up with 0x7fffffff even on a large decrementer
capable system.

As suggested by Nick, add a set_state_oneshot_stopped callback
so we program the decrementer with decrementer_max if there are
no future events.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 23:00:31 +10:00
Anton Blanchard
8b78fdb045 powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer
We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 23:00:30 +10:00
Mark Hairgrove
f86ad3e019 powerpc/powernv/npu: Remove atsd_threshold debugfs setting
This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:55:54 +10:00
Mark Hairgrove
3689c37d23 powerpc/powernv/npu: Use size-based ATSD invalidates
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.

The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.

ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.

With this change all issued ATSDs now perform a flush, so the flush
parameter has been removed from all the helpers.

To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.

One NPU (1 GPU):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.3        5.6           5%
1M         39.3       57.4          46%
2M         49.7       82.6          66%
4M        286.6      285.7           0%

Two NPUs (6 GPUs):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         6.5        7.4          13%
1M         33.4       67.9         103%
2M         38.7       93.1         141%
4M        356.7      354.6          -1%

Anything over 2M is roughly the same as before since both cases issue a
single ATSD.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:55:53 +10:00
Mark Hairgrove
7ead15a144 powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:

    for each NPU
        - Write the target address to the XTS_ATSD_AVA register
        - EIEIO
        - Write the launch value to issue the ATSD

First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.

Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:

    for each NPU
        - Write the target address to the XTS_ATSD_AVA register
    EIEIO
    for each NPU
        - Write the launch value to issue the ATSD

Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.

With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).

With two NPUs active (on a 6-GPU system) the effect is more noticeable:

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.9        6.5          10%
1M         31.2       33.4           7%
2M         36.3       38.7           7%
4M        322.6      356.7          11%

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:55:52 +10:00
Masahiro Yamada
bad96de8d3 powerpc: remove leftover code of old GCC version checks
Clean up the leftover of commit f2910f0e68 ("powerpc: remove old
GCC version checks").

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:54:51 +10:00
Daniel Axtens
f5e284803a powerpc/nohash: fix undefined behaviour when testing page size support
When enumerating page size definitions to check hardware support,
we construct a constant which is (1U << (def->shift - 10)).

However, the array of page size definitions is only initalised for
various MMU_PAGE_* constants, so it contains a number of 0-initialised
elements with def->shift == 0. This means we end up shifting by a
very large number, which gives the following UBSan splat:

================================================================================
UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
Call Trace:
[c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
[c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
[c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
[c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
[c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
[c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
================================================================================

Fix this by first checking if the element exists (shift != 0) before
constructing the constant.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-04 16:54:41 +10:00
Paul Mackerras
6579804c43 KVM: PPC: Book3S HV: Avoid crash from THP collapse during radix page fault
Commit 71d29f43b6 ("KVM: PPC: Book3S HV: Don't use compound_order to
determine host mapping size", 2018-09-11) added a call to 
__find_linux_pte() and a dereference of the returned PTE pointer to the
radix page fault path in the common case where the page is normal
system memory.  Previously, __find_linux_pte() was only called for
mappings to physical addresses which don't have a page struct (e.g.
memory-mapped I/O) or where the page struct is marked as reserved
memory.

This exposes us to the possibility that the returned PTE pointer
could be NULL, for example in the case of a concurrent THP collapse
operation.  Dereferencing the returned NULL pointer causes a host
crash.

To fix this, we check for NULL, and if it is NULL, we retry the
operation by returning to the guest, with the expectation that it
will generate the same page fault again (unless of course it has
been fixed up by another CPU in the meantime).

Fixes: 71d29f43b6 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-04 14:51:11 +10:00
Eric W. Biederman
f283801851 signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members.  The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-10-03 16:46:43 +02:00
Christophe Leroy
d90fe2acd9 powerpc: Wire up memtest
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.

Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 16:12:47 +10:00
Christophe Leroy
803d690e68 powerpc/mm: Don't report hugepage tables as memory leaks when using kmemleak
When a process allocates a hugepage, the following leak is
reported by kmemleak. This is a false positive which is
due to the pointer to the table being stored in the PGD
as physical memory address and not virtual memory pointer.

unreferenced object 0xc30f8200 (size 512):
  comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<e32b68da>] huge_pte_alloc+0xdc/0x1f8
    [<9e0df1e1>] hugetlb_fault+0x560/0x8f8
    [<7938ec6c>] follow_hugetlb_page+0x14c/0x44c
    [<afbdb405>] __get_user_pages+0x1c4/0x3dc
    [<b8fd7cd9>] __mm_populate+0xac/0x140
    [<3215421e>] vm_mmap_pgoff+0xb4/0xb8
    [<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc
    [<4fcd760f>] ret_from_syscall+0x0/0x38

See commit a984506c54 ("powerpc/mm: Don't report PUDs as
memory leaks when using kmemleak") for detailed explanation.

To fix that, this patch tells kmemleak to ignore the allocated
hugepage table.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:07 +10:00
Michael Neuling
306b1c0617 powerpc/tm: Reformat comments
The comments in this file don't conform to the coding style so take
them to "Comment Formatting Re-Education Camp".

Suggested-by: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Reflow some comments and add full stops, fix spelling of Sergeant.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:07 +10:00
Petr Vorel
5bd9b4445d powerpc/config: Enable CONFIG_PRINTK_TIME
for 64bit configs which use for CONFIG_LOG_BUF_SHIFT the same
or higher value than the default (currently 17).

Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:06 +10:00
YueHaibing
01b9870ea6 powerpc: Remove duplicated include from pci_32.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:06 +10:00
Michal Suchanek
8a03e81cb1 powerpc/64s: consolidate MCE counter increment.
The code in machine_check_exception excludes 64s hvmode when
incrementing the MCE counter only to call opal_machine_check to
increment it specifically for this case.

Remove the exclusion and special case.

Fixes: a43c159042 ("powerpc/pseries: Flush SLB contents on SLB MCE
		errors.")

Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:06 +10:00
Breno Leitao
51303113e3 powerpc/tm: Print 64-bits MSR
On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.

This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:05 +10:00
Breno Leitao
5c784c8414 powerpc/tm: Remove msr_tm_active()
Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.

This function is not necessary, since MSR_TM_ACTIVE() just do the same and
could be used, removing the dualism and simplifying the code.

This patchset remove every instance of msr_tm_active() and replaced it
by MSR_TM_ACTIVE().

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:05 +10:00
Breno Leitao
62dea077f5 powerpc/powernv: Mark function as __noreturn
There is a mismatch between function pnv_platform_error_reboot() definition
and declaration regarding function modifiers. In the declaration part, it
contains the function attribute __noreturn, while function definition
itself lacks it.

This was reported by sparse tool as an error:

  arch/powerpc/platforms/powernv/opal.c:538:6: error: symbol 'pnv_platform_error_reboot' redeclared with different type (originally declared at arch/powerpc/platforms/powernv/powernv.h:11) - different modifiers

I checked and the function is already being considered as being 'noreturn'
by the compiler, thus, I understand this patch does not change any code
being generated.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:05 +10:00
Breno Leitao
5521eb4bca powerpc/ptrace: Add support for PTRACE_SYSEMU
This is a patch that adds support for PTRACE_SYSEMU ptrace request in
PowerPC architecture.

When ptrace(PTRACE_SYSEMU, ...) request is called, it will be handled by
the arch independent function ptrace_resume(), which will tag the task with
the TIF_SYSCALL_EMU flag. This flag needs to be handled from a platform
dependent point of view, which is what this patch does.

This patch adds this task's flag as part of the _TIF_SYSCALL_DOTRACE, which
is the MACRO that is used to trace syscalls at entrance/exit.

Since TIF_SYSCALL_EMU is now part of _TIF_SYSCALL_DOTRACE, if the task has
_TIF_SYSCALL_DOTRACE set, it will hit do_syscall_trace_enter() at syscall
entrance and do_syscall_trace_leave() at syscall leave.
do_syscall_trace_enter() needs to handle the TIF_SYSCALL_EMU flag properly,
which will interrupt the syscall executing if TIF_SYSCALL_EMU is set. The
output values should not be changed, i.e. the return value (r3) should
contain the original syscall argument on exit.

With this flag set, the syscall is not executed fundamentally, because
do_syscall_trace_enter() is returning -1 which is bigger than NR_syscall,
thus, skipping the syscall execution and exiting userspace.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:04 +10:00
Breno Leitao
16d7c69c89 powerpc: Redefine TIF_32BITS thread flag
Moving TIF_32BIT to use bit 20 instead of 4 in the task flag field.

This change is making room for an upcoming new task macro
(_TIF_SYSCALL_EMU) which is preferred to set a bit in the lower 16-bits
part of the word.

This upcoming flag macro will take part in a composed macro
(_TIF_SYSCALL_DOTRACE) which will contain other flags as well, and it is
preferred that the whole _TIF_SYSCALL_DOTRACE macro only sets the lower 16
bits of a word, so, it could be handled using immediate operations (as load
immediate, add immediate, ...) where the immediate operand (SI) is limited
to 16-bits.

Another possible solution would be using the LOAD_REG_IMMEDIATE() macro
to load a full 64-bits word immediate, but it takes 5 operations instead of
one.

Having TIF_32BITS being redefined to use an upper bit is not a problem
since there is only one place in the assembly code where TIF_32BIT is being
used, and it could be replaced with an operation with right shift (addis),
since it is used alone, i.e. not being part of a composed macro, which has
different bits set, and would require LOAD_REG_IMMEDIATE().

Tested on a 64 bits Big Endian machine running a 32 bits task.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:04 +10:00
Christophe Leroy
06ec27aea9 powerpc/64: add stack protector support
On PPC64, as register r13 points to the paca_struct at all time,
this patch adds a copy of the canary there, which is copied at
task_switch.
That new canary is then used by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r13
-mstack-protector-guard-offset=offsetof(struct paca_struct, canary))

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:03 +10:00
Christophe Leroy
c3ff2a5193 powerpc/32: add stack protector support
This functionality was tentatively added in the past
(commit 6533b7c16e ("powerpc: Initial stack protector
(-fstack-protector) support")) but had to be reverted
(commit f2574030b0 ("powerpc: Revert the initial stack
protector support") because of GCC implementing it differently
whether it had been built with libc support or not.

Now, GCC offers the possibility to manually set the
stack-protector mode (global or tls) regardless of libc support.

This time, the patch selects HAVE_STACKPROTECTOR only if
-mstack-protector-guard=tls is supported by GCC.

On PPC32, as register r2 points to current task_struct at
all time, the stack_canary located inside task_struct can be
used directly by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r2
-mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary))

The protector is disabled for prom_init and bootx_init as
it is too early to handle it properly.

 $ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT
[  134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64
[  134.943666]
[  134.955414] CPU: 0 PID: 283 Comm: sh Not tainted 4.18.0-s3k-dev-12143-ga3272be41209 #835
[  134.963380] Call Trace:
[  134.965860] [c6615d60] [c001f76c] panic+0x118/0x260 (unreliable)
[  134.971775] [c6615dc0] [c001f654] panic+0x0/0x260
[  134.976435] [c6615dd0] [c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64
[  134.982769] [c6615e00] [ffffffff] 0xffffffff

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:03 +10:00
zhong jiang
cd5ff94577 powerpc/xive: Move a dereference below a NULL test
Move the dereference of xc below the NULL test.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:02 +10:00
Naveen N. Rao
9258227e9d powerpc/pseries: Fix how we iterate over the DTL entries
When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.

Fixes: 7c105b63bd ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03 15:40:02 +10:00