This allows guests to have a different timebase origin from the host.
This is needed for migration, where a guest can migrate from one host
to another and the two hosts might have a different timebase origin.
However, the timebase seen by the guest must not go backwards, and
should go forwards only by a small amount corresponding to the time
taken for the migration.
Therefore this provides a new per-vcpu value accessed via the one_reg
interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value
defaults to 0 and is not modified by KVM. On entering the guest, this
value is added onto the timebase, and on exiting the guest, it is
subtracted from the timebase.
This is only supported for recent POWER hardware which has the TBU40
(timebase upper 40 bits) register. Writing to the TBU40 register only
alters the upper 40 bits of the timebase, leaving the lower 24 bits
unchanged. This provides a way to modify the timebase for guest
migration without disturbing the synchronization of the timebase
registers across CPU cores. The kernel rounds up the value given
to a multiple of 2^24.
Timebase values stored in KVM structures (struct kvm_vcpu, struct
kvmppc_vcore, etc.) are stored as host timebase values. The timebase
values in the dispatch trace log need to be guest timebase values,
however, since that is read directly by the guest. This moves the
setting of vcpu->arch.dec_expires on guest exit to a point after we
have restored the host timebase so that vcpu->arch.dec_expires is a
host timebase value.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently we are not saving and restoring the SIAR and SDAR registers in
the PMU (performance monitor unit) on guest entry and exit. The result
is that performance monitoring tools in the guest could get false
information about where a program was executing and what data it was
accessing at the time of a performance monitor interrupt. This fixes
it by saving and restoring these registers along with the other PMU
registers on guest entry/exit.
This also provides a way for userspace to access these values for a
vcpu via the one_reg interface.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Reserved fields of the sync instruction have been used for other
instructions (e.g. lwsync). On processors that do not support variants
of the sync instruction, emulate it by executing a sync to subsume the
effect of the intended instruction.
Signed-off-by: James Yang <James.Yang@freescale.com>
[scottwood@freescale.com: whitespace and subject line fix]
Signed-off-by: Scott Wood <scottwood@freescale.com>
On Book3E some SPE/FP/AltiVec interrupts share the same number. Use
common defines to indentify these numbers.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[scottwood@freescale.com: fixed space-before-tab]
Signed-off-by: Scott Wood <scottwood@freescale.com>
On Book3E some SPE/FP/AltiVec interrupts share the same number. Use
common defines to indentify these numbers.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Topic branch for commits that the KVM tree might want to pull
in separately.
Hand merged a few files due to conflicts with the LE stuff
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This provides a facility which is intended for use by KVM, where the
contents of the FP/VSX and VMX (Altivec) registers can be saved away
to somewhere other than the thread_struct when kernel code wants to
use floating point or VMX instructions. This is done by providing a
pointer in the thread_struct to indicate where the state should be
saved to. The giveup_fpu() and giveup_altivec() functions test these
pointers and save state to the indicated location if they are non-NULL.
Note that the MSR_FP/VEC bits in task->thread.regs->msr are still used
to indicate whether the CPU register state is live, even when an
alternate save location is being used.
This also provides load_fp_state() and load_vr_state() functions, which
load up FP/VSX and VMX state from memory into the CPU registers, and
corresponding store_fp_state() and store_vr_state() functions, which
store FP/VSX and VMX state into memory from the CPU registers.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This creates new 'thread_fp_state' and 'thread_vr_state' structures
to store FP/VSX state (including FPSCR) and Altivec/VSX state
(including VSCR), and uses them in the thread_struct. In the
thread_fp_state, the FPRs and VSRs are represented as u64 rather
than double, since we rarely perform floating-point computations
on the values, and this will enable the structures to be used
in KVM code as well. Similarly FPSCR is now a u64 rather than
a structure of two 32-bit values.
This takes the offsets out of the macros such as SAVE_32FPRS,
REST_32FPRS, etc. This enables the same macros to be used for normal
and transactional state, enabling us to delete the transactional
versions of the macros. This also removes the unused do_load_up_fpu
and do_load_up_altivec, which were in fact buggy since they didn't
create large enough stack frames to account for the fact that
load_up_fpu and load_up_altivec are not designed to be called from C
and assume that their caller's stack frame is an interrupt frame.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We already had some output messages from EEH core. Occasionally,
we can see the output messages from EEH core before the stack
dump. That's not what we expected. The patch fixes that and shows
the stack dump prior to output messages from EEH core.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the CPU is generating an exception when accessing unaligned word, and
as this exception is not yet handled when running prom_init, data should be
copied from the architecture vector byte per byte.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The performance monitor interrupt is asynchronous, so we should check
if the current processor is in napping status in the handler of this
interrupt.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This was missing on powerpc and I am getting compilation error
drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'
drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While cross-building for PPC64 I've got bunch of
WARNING: arch/powerpc/kernel/built-in.o(.text.unlikely+0x2d2): Section
mismatch in reference from the function .free_lppacas() to the variable
.init.data:lppaca_size The function .free_lppacas() references the variable
__initdata lppaca_size. This is often because .free_lppacas lacks a __initdata
annotation or the annotation of lppaca_size is wrong.
Fix it by using proper annotation for free_lppacas. Additionally, annotate
{allocate,new}_llpcas properly.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>
Acked-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We already got the value of current_thread_info and ti_flags and store
them into r9 and r4 respectively before jumping to resume_kernel. So
there is no reason to reload them again.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__initdata tag should be placed between the variable name and equal
sign for the variable to be placed in the intended .init.data section.
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch allows the kbuild system to successfully compile a kernel
for the little endian PowerPC64 architecture. A subsequent patch
will add the CONFIG_CPU_LITTLE_ENDIAN kernel config option which
must be set to build such a kernel.
If cross compiling, CROSS_COMPILE must point to a suitable toolchain
(compiled for the powerpc64le-linux and powerpcle-linux targets).
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to fix some endian issues in our memcpy code. For now
just enable the generic memcpy routine for little endian builds.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to fix some endian issues in our checksum code. For now
just enable the generic checksum routines for little endian builds.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Things are complicated by the fact that VSX elements are big
endian ordered even in little endian mode. 8 byte loads and
stores also write to the top 8 bytes of the register.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Handle most unaligned load and store faults in little
endian mode. Strings, multiples and VSX are not supported.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The TS_FPR macro selects the FPR component of a VSX register (the
high doubleword). emulate_vsx is using this macro to get the
address of the associated VSX register. This happens to work on big
endian, but fails on little endian.
Replace it with an explicit array access.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The alignment handler assumes big endian ordering when selecting
the low word of a 64bit floating point value. Use the existing
union which works in both little and big endian.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use swab64/32/16 instead of open coding it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Create a trampoline that works in either endian and flips to
the expected endian. Use it for primary and secondary thread
entry as well as RTAS and OF call return.
Credit for finding the magic instruction goes to Paul Mackerras
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We always take signals in big endian which is wrong. Signals
should be taken in native endian.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
FPRs overlap the high 64bits of the first 32 VSX registers. The
ptrace FP read/write code assumes big endian ordering and grabs
the lowest 64 bits.
Fix this by using the TS_FPR macro which does the right thing.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
irq_exit() is now called on the irq stack, which can trigger a switch to
the softirq stack from the irq stack. If an interrupt happens at that
point, we will not properly detect the re-entrancy and clobber the
original return context on the irq stack.
This fixes it. The side effect is to prevent all nesting from softirq
stack to irq stack even in the "safe" case but it's simpler that way and
matches what x86_64 does.
Reported-by: Cédric Le Goater <clg@fr.ibm.com>
Tested-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we do a treclaim or trecheckpoint we end up running with userspace
PPR and DSCR values. Currently we don't do anything special to avoid
running with user values which could cause a severe performance
degradation.
This patch moves the PPR and DSCR save and restore around treclaim and
trecheckpoint so that we run with user values for a much shorter period.
More care is taken with the PPR as it's impact is greater than the DSCR.
This is similar to user exceptions, where we run HTM_MEDIUM early to
ensure that we don't run with a userspace PPR values in the kernel.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can't take IRQs in tm_reclaim as we might have a bogus r13 and r1.
This turns IRQs hard off in this function.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
modalias_show() should return an empty string on error, not -ENODEV.
This causes the following false and annoying error:
> find /sys/devices -name modalias -print0 | xargs -0 cat >/dev/null
cat: /sys/devices/vio/4000/modalias: No such device
cat: /sys/devices/vio/4001/modalias: No such device
cat: /sys/devices/vio/4002/modalias: No such device
cat: /sys/devices/vio/4004/modalias: No such device
cat: /sys/devices/vio/modalias: No such device
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org>
Under heavy (DLPAR?) stress, we tripped this panic() in
arch/powerpc/kernel/iommu.c::iommu_init_table():
page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz));
if (!page)
panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
Before the panic() we got a page allocation failure for an order-2
allocation. There appears to be memory free, but perhaps not in the
ATOMIC context. I looked through all the call-sites of
iommu_init_table() and didn't see any obvious reason to need an ATOMIC
allocation. Most call-sites in fact have an explicit GFP_KERNEL
allocation shortly before the call to iommu_init_table(), indicating we
are not in an atomic context. There is some indirection for some paths,
but I didn't see any locks indicating that GFP_KERNEL is inappropriate.
With this change under the same conditions, we have not been able to
reproduce the panic.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org>
arch/powerpc/kernel/sysfs.c exports PURR with write permission.
This may be valid for kernel in phyp mode. But writing to
the file in guest mode causes crash due to a priviledge violation
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org>
cpu_hotplug_driver_lock() serializes CPU online/offline operations
when ARCH_CPU_PROBE_RELEASE is set. This lock interface is no longer
necessary with the following reason:
- lock_device_hotplug() now protects CPU online/offline operations,
including the probe & release interfaces enabled by
ARCH_CPU_PROBE_RELEASE. The use of cpu_hotplug_driver_lock() is
redundant.
- cpu_hotplug_driver_lock() is only valid when ARCH_CPU_PROBE_RELEASE
is defined, which is misleading and is only enabled on powerpc.
This patch removes the cpu_hotplug_driver_lock() interface. As
a result, ARCH_CPU_PROBE_RELEASE only enables / disables the cpu
probe & release interface as intended. There is no functional change
in this patch.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead. This converts the VIO bus code to use the
correct field.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead. This converts the ibmebus bus code to use the
correct field.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Starting secondary CPUs early on from Open Firmware and placing them
in a holding spin loop slows down the boot process significantly under
some hypervisors such as KVM.
This is also unnecessary when RTAS supports querying the CPU state
So let's not do it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've been keeping that field in thread_struct for a while, it contains
the "limit" of the current stack pointer and is meant to be used for
detecting stack overflows.
It has a few problems however:
- First, it was never actually *used* on 64-bit. Set and updated but
not actually exploited
- When switching stack to/from irq and softirq stacks, it's update
is racy unless we hard disable interrupts, which is costly. This
is fine on 32-bit as we don't soft-disable there but not on 64-bit.
Thus rather than fixing 2 in order to implement 1 in some hypothetical
future, let's remove the code completely from 64-bit. In order to avoid
a clutter of ifdef's, we remove the updates from C code completely
during interrupt stack switching, and instead maintain it from the
asm helper that is used to do the stack switching in the first place.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Nowadays, irq_exit() calls __do_softirq() pretty much directly
instead of calling do_softirq() which switches to the decicated
softirq stack.
This has lead to observed stack overflows on powerpc since we call
irq_enter() and irq_exit() outside of the scope that switches to
the irq stack.
This fixes it by moving the stack switching up a level, making
irq_enter() and irq_exit() run off the irq stack.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Replace the following sequence:
dma_set_mask(dev, mask);
dma_set_coherent_mask(dev, mask);
with a call to the new helper dma_set_mask_and_coherent().
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
While cross-building for PPC64 I've got
WARNING: vmlinux.o(.text.unlikely+0x1ba): Section mismatch in
reference from the function .prom_rtas_call() to the variable
.init.data:dt_string_start The function .prom_rtas_call() references
the variable __initdata dt_string_start. This is often because
.prom_rtas_call lacks a __initdata annotation or the annotation of
dt_string_start is wrong.
WARNING: vmlinux.o(.meminit.text+0xeb0): Section mismatch in reference
from the function .free_area_init_core.isra.47() to the function
.init.text:.set_pageblock_order() The function __meminit
.free_area_init_core.isra.47() references a function __init
.set_pageblock_order(). If .set_pageblock_order is only used by
.free_area_init_core.isra.47 then annotate .set_pageblock_order with a
matching annotation.
Fix it by proper annotation of prom_rtas_call.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
powerpc allmodconfig build fails with:
ERROR: ".cpu_to_chip_id" [drivers/block/mtip32xx/mtip32xx.ko] undefined!
The problem was introduced with commit 15863ff3b (powerpc: Make chip-id
information available to userspace).
Export the missing symbol.
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>