POWER9 requires the host to set up a partition table, which is a
table in memory indexed by logical partition ID (LPID) which
contains the pointers to page tables and process tables for the
host and each guest.
This factors out the initialization of the partition table into
a single function. This code was previously duplicated between
hash_utils_64.c and pgtable-radix.c.
This provides a function for setting a partition table entry,
which is used in early MMU initialization, and will be used by
KVM whenever a guest is created. This function includes a tlbie
instruction which will flush all TLB entries for the LPID and
all caches of the partition table entry for the LPID, across the
system.
This also moves a call to memblock_set_current_limit(), which was
in radix_init_partition_table(), but has nothing to do with the
partition table. By analogy with the similar code for hash, the
call gets moved to near the end of radix__early_init_mmu(). It
now gets called when running as a guest, whereas previously it
would only be called if the kernel is running as the host.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is the reasonable expectation that if an executable file is not
readable there will be no way for a user without special privileges to
read the file. This is enforced in ptrace_attach but if ptrace
is already attached before exec there is no enforcement for read-only
executables.
As the only way to read such an mm is through access_process_vm
spin a variant called ptrace_access_vm that will fail if the
target process is not being ptraced by the current process, or
the current process did not have sufficient privileges when ptracing
began to read the target processes mm.
In the ptrace implementations replace access_process_vm by
ptrace_access_vm. There remain several ptrace sites that still use
access_process_vm as they are reading the target executables
instructions (for kernel consumption) or register stacks. As such it
does not appear necessary to add a permission check to those calls.
This bug has always existed in Linux.
Fixes: v1.0
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
After patch 4efca4ed0 ("kbuild: modversions for EXPORT_SYMBOL() for asm"),
asm exports can get modversions CRCs generated if they have C definitions
in asm-prototypes.h. This patch adds missing definitions for 32 and 64 bit
allmodconfig builds.
Fixes: 9445aa1a30 ("ppc: move exports to definitions")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The T4240RDB contains a W83793 hardware monitoring chip. Add a device
tree entry to make the driver attach to it, as the i2c-mpc bus driver
dropped support for class-based instantiation of devices a long time
ago.
Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: Scott Wood <oss@buserror.net>
There is a new bit, LPCR_PECE_HVEE (Hypervisor Virtualization Exit
Enable), which controls wakeup from STOP states on Hypervisor
Virtualization Interrupts (which happen to also be all external
interrupts in host or bare metal mode).
It needs to be set or we will miss wakeups.
Fixes: 9baaef0a22 ("powerpc/irq: Add support for HV virtualization interrupts")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rename it to HVEE to match the name in the ISA]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_pe_reset and eeh_reset_pe are two different functions in the same
file which do mostly the same thing. Not only is this confusing, but
potentially causes disrepancies in functionality, notably eeh_reset_pe
as it does not check return values for failure.
Refactor this into the following:
- eeh_pe_reset(): stays as is, performs a single operation, exported
- eeh_pe_reset_full(): new, full reset process that calls eeh_pe_reset()
- eeh_reset_pe(): removed and replaced by eeh_pe_reset_full()
- eeh_reset_pe_once(): removed
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PHB, PE (and by association MVE) numbers are printed as a mix of decimal
and hexadecimal throughout the kernel. This can be misleading, so make
them all hexadecimal.
Standardising on hex instead of dec because:
- PHB numbers are presented in hex in sysfs/debugfs (and lspci, etc)
- PE numbers are presented as hex in sysfs and parsed in hex in debugfs
The only place I think this could cause confusing are the messages during
boot, i.e.
pci 000a:01 : [PE# 000] Secondary bus 1 associated with PE#0
which can be a quick way to check PE numbers. pe_level_printk() will
only print two characters instead of three, so the above would be
pci 000a:01 : [PE# 00] Secondary bus 1 associated with PE#0
which gives a hint it's in hex.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Whenever a PE is initialised in powernv, opal_pci_eeh_freeze_clear() is
called. This is to remove any existing freeze, and has no negative side
effects if the PE is already in an unfrozen state. On PHB backends that
don't support this operation and return OPAL_UNSUPPORTED, this creates a
scary and misleading warning message.
Skip the warning message on init if OPAL_UNSUPPORTED is returned.
As far as I'm aware, this currently only affects NPUs.
Fixes: 313483d ("powerpc/powernv: Unfreeze PE on allocation")
Signed-off-by: Russell Currey <ruscur@russell.cc>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Drop duplicate header asm/iommu.h from book3s_64_vio_hv.c.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The hashed page table MMU in POWER processors can update the R
(reference) and C (change) bits in a HPTE at any time until the
HPTE has been invalidated and the TLB invalidation sequence has
completed. In kvmppc_h_protect, which implements the H_PROTECT
hypercall, we read the HPTE, modify the second doubleword,
invalidate the HPTE in memory, do the TLB invalidation sequence,
and then write the modified value of the second doubleword back
to memory. In doing so we could overwrite an R/C bit update done
by hardware between when we read the HPTE and when the TLB
invalidation completed. To fix this we re-read the second
doubleword after the TLB invalidation and OR in the (possibly)
new values of R and C. We can use an OR since hardware only ever
sets R and C, never clears them.
This race was found by code inspection. In principle this bug could
cause occasional guest memory corruption under host memory pressure.
Fixes: a8606e20e4 ("KVM: PPC: Handle some PAPR hcalls in the kernel", 2011-06-29)
Cc: stable@vger.kernel.org # v3.19+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When switching from/to a guest that has a transaction in progress,
we need to save/restore the checkpointed register state. Although
XER is part of the CPU state that gets checkpointed, the code that
does this saving and restoring doesn't save/restore XER.
This fixes it by saving and restoring the XER. To allow userspace
to read/write the checkpointed XER value, we also add a new ONE_REG
specifier.
The visible effect of this bug is that the guest may see its XER
value being corrupted when it uses transactions.
Fixes: e4e3812150 ("KVM: PPC: Book3S HV: Add transactional memory support")
Fixes: 0a8eccefcb ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently we mark a HPTE for emulated MMIO with HPTE_V_ABSENT bit
set as well as key 0x1f. However, those HPTEs may be conflicted with
the HPTE for real guest RAM page HPTE with key 0x1f when the page
get paged out.
This patch clears the key field of HPTE when the page is paged out,
then recover it when HPTE is re-established.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Using list_move_tail() instead of list_del() + list_add_tail().
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A bunch of KVM functions are only called from assembler.
Give them prototypes in asm-prototypes.h
This reduces sparse warnings.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Squash a couple of sparse warnings by making things static.
Build tested.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Pull powerpc fixes from Michael Ellerman:
"Fixes marked for stable:
- fix system reset interrupt winkle wakeups
- fix setting of AIL in hypervisor mode
Fixes for code merged this cycle:
- fix exception vector build with 2.23 era binutils
- fix missing update of HID register on secondary CPUs
Other:
- fix missing pr_cont()s
- invalidate ERAT on tlbiel for POWER9 DD1"
* tag 'powerpc-4.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix missing update of HID register on secondary CPUs
powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1
powerpc/64: Fix setting of AIL in hypervisor mode
powerpc/oops: Fix missing pr_cont()s in instruction dump
powerpc/oops: Fix missing pr_cont()s in show_regs()
powerpc/oops: Fix missing pr_cont()s in print_msr_bits() et. al.
powerpc/oops: Fix missing pr_cont()s in show_stack()
powerpc: Fix exception vector build with 2.23 era binutils
powerpc/64s: Fix system reset interrupt winkle wakeups
We need to update on secondaries for the selected MMU mode.
Fixes: ad410674f5 ("powerpc/mm: Update the HID bit when switching from radix to hash")
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The ibm_pa_features array consists of structures that describe which bit
and byte in the ibm,pa-features property toggles one or more flags in
either the CPU, MMU, or user visible feature flags.
Each one consists of 7 values, which are all unsigned long, int or char,
meaning the compiler gives us no warning if we assign the wrong values
to the wrong elements. In fact we have had a bug here in the past, where
we were setting incorrect bits, see commit 6997e57d69 ("powerpc:
scan_features() updates incorrect bits for REAL_LE").
So switch to using named initialisers for the structure elements, to
reduce the likelihood of future bugs, and hopefully improve readability
also.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
These are the PPC optimised versions of various crypto algorithms, so we
should turn them on by default to get test coverage.
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The IBMEBUS code supports the GX bus found on Power7 and earlier CPUs.
On Power8 it has been replaced, and so we have no need for it.
We don't actually have a config symbol for Power8 vs Power7 etc., but
we only support booting little endian on Power8 or later, so use that as
a reasonable approximation.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When ending an oops, don't clear die_owner unless the nest count
went to zero. This prevents a second nested oops from hanging forever
on the die_lock.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When exiting xmon with 'x' (exit and recover), oops_begin bails
out immediately, but die then calls __die() and oops_end(), which
cause a lot of bad things to happen.
If the debugger was attached then went to graceful recovery, exit
from die() immediately.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add an option to use thin archives to build the kernel.
Thin archives are explained in commit a5967db9af ("kbuild: allow
architectures to use thin archives instead of ld -r").
This is a gradual way to introduce the option to testers.
Some change to the way we invoke ar is required so it can be used
by scripts/link-vmlinux.sh.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make it an explicit option not dependant on COMPILE_TEST]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Under some configs we need to explicitly include cpu_has_feature.h,
otherwise we fail with:
arch/powerpc/lib/sstep.c:1992:7: error: implicit declaration of function 'cpu_has_feature'
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pasrsing of data written to the dlpar file in sysfs does not correctly
account for the possibility of reading past the end of the buffer. The code
assumes that all pieces of the command witten to the sysfs file are present
in the form "<resource> <action> <id_type> <id>".
Correct this by updating the buffer parsing code to make a local copy and
use the strsep() and sysfs_streq() routines to parse the buffer. This patch
also separates the parsing code into subroutines for each piece of the
command.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove the unused but set variable srr1 in save_mce_event() to
fix the following GCC warning when building with 'W=1':
arch/powerpc/kernel/mce.c:75:11: warning: variable 'srr1' set but not used
It has never been used.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix two [-Wold-style-declaration] GCC warnings by moving the inline
keyword before the return type.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
MIN_HUGEPTE_SHIFT hasn't been used since commit d1837cba5d
("powerpc/mm: Cleanup initialization of hugepages on powerpc")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9 DD1, when we do a local TLB invalidate we also need to explicitly
invalidate the ERAT.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Version 3.00 of the ISA states that the PATS (partition table size) field
of the PTCR (partition table control register) and the PRTS (process table
size) field of the partition table entry must both be less than or equal
to 24. However the actual size of the partition and process tables is equal
to 2 to the power of 12 plus the PATS and PRTS fields, respectively. This
means that the max allowable size of each of these tables is 2^36 or 64GB
for both.
Thus when checking the size shift for each we should be checking for values
of greater than 36 instead of the current check for shifts larger than 24
and 23.
Fixes: 2bfd65e45e
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Useful to be able to dump the kernel hash page table to check
which pages are hashed along with their sizes and other details.
Add a debugfs file to check the hash page table. If radix is enabled
(and so there is no hash page table) then this file doesn't exist. To
use this the PPC_PTDUMP config option must be selected.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
[mpe: Fix build with SPARSEMEM_VMEMMAP=n & PSERIES=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Useful to be able to dump the kernels page tables to check permissions
and memory types - derived from arm64's implementation.
Add a debugfs file to check the page tables. To use this the PPC_PTDUMP
config option must be selected.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently there's some CMO (Cooperative Memory Overcommit) code, in
plpar_wrappers.h. Some of it is #ifdef CONFIG_PSERIES and some of it
isn't. The end result being if a file includes plpar_wrappers.h it won't
build with CONFIG_PSERIES=n.
Fix it by moving the CMO code into platforms/pseries. The two hcall
wrappers can just be moved into their only caller, cmm.c, and the
accessors can go in pseries.h.
Note we need the accessors because cmm.c can be built as a module, so
there needs to be a split between the built-in code vs the module, and
that's achieved by using those accessors.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This changes the way that we support the new ISA v3.00 HPTE format.
Instead of adapting everything that uses HPTE values to handle either
the old format or the new format, depending on which CPU we are on,
we now convert explicitly between old and new formats if necessary
in the low-level routines that actually access HPTEs in memory.
This limits the amount of code that needs to know about the new
format and makes the conversions explicit. This is OK because the
old format contains all the information that is in the new format.
This also fixes operation under a hypervisor, because the H_ENTER
hypercall (and other hypercalls that deal with HPTEs) will continue
to require the HPTE value to be supplied in the old format. At
present the kernel will not boot in HPT mode on POWER9 under a
hypervisor.
This fixes and partially reverts commit 50de596de8
("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).
Fixes: 50de596de8 ("powerpc/mm/hash: Add support for Power9 Hash")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>