Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.
For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.
This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.
This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a kernel doesn't support MSA context (ie. CONFIG_CPU_HAS_MSA=n) then
it will only keep 64 bits per FP register in thread context, and the
calls to set_fpr64 in restore_msa_extcontext will overrun the end of the
FP register context into the FCSR & MSACSR values. GCC 6.x has become
smart enough to detect this & complain like so:
arch/mips/kernel/signal.c: In function 'protected_restore_fp_context':
./arch/mips/include/asm/processor.h:114:17: error: array subscript is above array bounds [-Werror=array-bounds]
fpr->val##width[FPR_IDX(width, idx)] = val; \
~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
./arch/mips/include/asm/processor.h:118:1: note: in expansion of macro 'BUILD_FPR_ACCESS'
BUILD_FPR_ACCESS(64)
The only way to trigger this code to run would be for a program to set
up an artificial extended MSA context structure following a sigframe &
execute sigreturn. Whilst this doesn't allow a program to write to any
state that it couldn't already, it makes little sense to allow this
"restoration" of MSA context in a system that doesn't support MSA.
Fix this by killing a program with SIGSYS if it tries something as crazy
as "restoring" fake MSA context in this way, also fixing the build error
& allowing for most of restore_msa_extcontext to be optimised out of
kernels without support for MSA.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Reported-by: Michal Toman <michal.toman@imgtec.com>
Fixes: bf82cb30c7 ("MIPS: Save MSA extended context around signals")
Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Michal Toman <michal.toman@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org> # v4.3+
Patchwork: https://patchwork.linux-mips.org/patch/13164/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Calculate the MIPS clockevent device's min_delta_ns dynamically based on
the time it takes to perform the mips_next_event() sequence.
Virtualisation in particular makes the current fixed min_delta of 0x300
inappropriate under some circumstances, as the CP0_Count and CP0_Compare
registers may be being emulated by the hypervisor, and the frequency may
not correspond directly to the CPU frequency.
We actually use twice the median of multiple 75th percentiles of
multiple measurements of how long the mips_next_event() sequence takes,
in order to fairly efficiently eliminate outliers due to unexpected
hypervisor latency (which would need handling with retries when it
occurs during normal operation anyway).
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13176/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
When estimating the clock frequency based on the RTC, take seconds into
account in case the Update In Progress (UIP) bit wasn't seen. This can
happen in virtual machines (which may get pre-empted by the hypervisor
at inopportune times) with QEMU emulating the RTC (and in fact not
setting the UIP bit for very long), especially on slow hosts such as
FPGA systems and hardware emulators. This results in several seconds
actually having elapsed before seeing the UIP bit instead of just one
second, and exaggerated timer frequencies.
While updating the comments, they're also fixed to match the code in
that the rising edge of the update flag is detected first, not the
falling edge.
The rising edge gives a more precise point to read the counters in a
virtualised system than the falling edge, resulting in a more accurate
frequency.
It does however mean that we have to also wait for the falling edge
before doing the read of the RTC seconds register, otherwise it seems to
be possible in slow hardware emulation to stray into the interval when
the RTC time is undefined during the update (at least 244uS after the
rising edge of the update flag). This can result in both seconds values
reading the same, and it wrapping to 60 seconds, vastly underestimating
the frequency.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13174/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The sampling of the GIC counter on Malta after observing a rising edge
of the RTC update flag differs slightly between the first and second
sample, with the first sample also calling gic_start_count(). The two
samples should really be taken as similarly as possible to get the most
accurate figure, so move the gic_start_count() call before detecting the
rising edge.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13173/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Commit 9791554b45 ("MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options
for MIPS") added support for the PR_SET_FP_MODE prctl, which allows a
userland program to modify its FP mode at runtime. This is most notably
required if dynamic linking leads to the FP mode requirement changing at
runtime from that indicated in the initial executable's ELF header. In
order to avoid overhead in the general FP context restore code, it aimed
to have threads in the process become unable to enable the FPU during a
mode switch & have the thread calling the prctl syscall wait for all
other threads in the process to be context switched at least once. Once
that happens we can know that no thread in the process whose mode will
be switched has live FP context, and it's safe to perform the mode
switch. However in the (rare) case of modeswitches occurring in
multithreaded programs this can lead to indeterminate delays for the
thread invoking the prctl syscall, and the code monitoring for those
context switches was woefully inadequate for all but the simplest cases.
Fix this by broadcasting an IPI if other CPUs may have live FP context
for an affected thread, with a handler causing those CPUs to relinquish
their FPU ownership. Threads will then be allowed to continue running
but will stall on the wait_on_atomic_t in enable_restore_fp_context if
they attempt to use FP again whilst the mode switch is still in
progress. The end result is less fragile poking at scheduler context
switch counts & a more expedient completion of the mode switch.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes: 9791554b45 ("MIPS,prctl: add PR_[GS]ET_FP_MODE prctl options for MIPS")
Reviewed-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: stable <stable@vger.kernel.org> # v4.0+
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13145/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
If an address error exception occurs for a LDXC1 or SDXC1 instruction,
within the cop1x opcode space, allow it to be passed through to the FPU
emulator rather than resulting in a SIGILL. This causes LDXC1 & SDXC1 to
be handled in a manner consistent with the more common LDC1 & SDC1
instructions.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13143/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Correct the cases missed with commit 9b26616c8d ("MIPS: Respect the
ISA level in FCSR handling") and prevent writes to read-only FCSR bits
there.
This in particular applies to FP context initialisation where any IEEE
754-2008 bits preset by `mips_set_personality_nan' are cleared before
the relevant ptrace(2) call takes effect and the PTRACE_POKEUSR request
addressing FPC_CSR where no masking of read-only FCSR bits is done.
Remove the FCSR clearing from FP context initialisation then and unify
PTRACE_POKEUSR/FPC_CSR and PTRACE_SETFPREGS handling, by factoring out
code from `ptrace_setfpregs' and calling it from both places.
This mostly matters to soft float configurations where the emulator can
be switched this way to a mode which should not be accessible and cannot
be set with the CTC1 instruction. With hard float configurations any
effect is transient anyway as read-only bits will retain their values at
the time the FP context is restored.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: stable@vger.kernel.org # v4.0+
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13239/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Fix a floating-point context restoration regression introduced with
commit 9b26616c8d ("MIPS: Respect the ISA level in FCSR handling")
that causes a Floating Point exception and consequently a kernel oops
with hard float configurations when one or more FCSR Enable and their
corresponding Cause bits are set both at a time via a ptrace(2) call.
To do so reinstate Cause bit masking originally introduced with commit
b1442d39fa ("MIPS: Prevent user from setting FCSR cause bits") to
address this exact problem and then inadvertently removed from the
PTRACE_SETFPREGS request with the commit referred above.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com>
Cc: stable@vger.kernel.org # v4.0+
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13238/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The GuestCtl1 CP0 register can contain the GuestID used for root TLB
operations, which affects TLB matching. The other TLB registers are
already dumped out to the log on a machine check exception due to
multiple matching TLB entries, so also dump the value of the GuestCtl1
register if GuestIDs are supported.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13232/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The GuestID for root TLB operations (GuestCtl1.RID) is modified by TLB
reads, so needs preserving by dump_tlb() like the ASID field of EntryHi.
Also dump the GuestID of each entry if it exists alongside the ASID, as
it forms an important part of the TLB entry when VZ guests are used.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13230/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Add a few new cpu-features.h definitions for VZ sub-features, namely the
existence of the CP0_GuestCtl0Ext, CP0_GuestCtl1, and CP0_GuestCtl2
registers, and support for GuestID to dialias TLB entries belonging to
different guests.
Also add certain features present in the guest, with the naming scheme
cpu_guest_has_*. These are added separately to the main options bitfield
since they generally parallel similar features in the root context. A
few of these (FPU, MSA, watchpoints, perf counters, CP0_[X]ContextConfig
registers, MAAR registers, and probably others in future) can be
dynamically configured in the guest context, for which the
cpu_guest_has_dyn_* macros are added.
[ralf@linux-mips.org: Resolve merge conflict.]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13231/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Add CPU feature for standard MIPS r2 performance counters, as determined
by the Config1.PC bit. Both perf_events and oprofile probe this bit, so
lets combine the probing and change both to use cpu_has_perf.
This will also be used for VZ support in KVM to know whether performance
counters exist which can be exposed to guests.
[ralf@linux-mips.org: resolve conflict.]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Robert Richter <rric@kernel.org>
Cc: linux-mips@linux-mips.org
Cc: oprofile-list@lists.sf.net
Patchwork: https://patchwork.linux-mips.org/patch/13226/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The CP0_[X]ContextConfig registers are present if CP0_Config3.CTXTC or
CP0_Config3.SM are set, and provide more control over which bits of
CP0_[X]Context are set to the faulting virtual address on a TLB
exception.
KVM/VZ will need to be able to save and restore these registers in the
guest context, so add the relevant definitions and probing of the
ContextConfig feature in the root context first.
[ralf@linux-mips.org: resolve merge conflict.]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13225/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The optional CP0_BadInstr and CP0_BadInstrP registers are written with
the encoding of the instruction that caused a synchronous exception to
occur, and the prior branch instruction if in a delay slot.
These will be useful for instruction emulation in KVM, and especially
for VZ support where reading guest virtual memory is a bit more awkward.
Add CPU option numbers and cpu_has_* definitions to indicate the
presence of each registers, and add code to probe for them using bits in
the CP0_Config3 register.
[ralf@linux-mips.org: resolve merge conflict.]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13224/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The CP0_EBase register may optionally have a write gate (WG) bit to
allow the upper bits to be written, i.e. bits 31:30 on MIPS32 since r3
(to allow for an exception base outside of KSeg0/KSeg1 when segmentation
control is in use) and bits 63:30 on MIPS64 (which also implies the
extension of CP0_EBase to 64 bits long).
The presence of this feature will need to be known about for VZ support
in order to correctly save and restore all the bits of the guest
CP0_EBase register, so add CPU feature definition and probing for this
feature.
Probing the WG bit on MIPS64 can be a bit fiddly, since 64-bit COP0
register access instructions were UNDEFINED for 32-bit registers prior
to MIPS r6, and it'd be nice to be able to probe without clobbering the
existing state, so there are 3 potential paths:
- If we do a 32-bit read of CP0_EBase and the WG bit is already set, the
register must be 64-bit.
- On MIPS r6 we can do a 64-bit read-modify-write to set CP0_EBase.WG,
since the upper bits will read 0 and be ignored on write if the
register is 32-bit.
- On pre-r6 cores, we do a 32-bit read-modify-write of CP0_EBase. This
avoids the potentially UNDEFINED behaviour, but will clobber the upper
32-bits of CP0_EBase if it isn't a simple sign extension (which also
requires us to ensure BEV=1 or modifying the exception base would be
UNDEFINED too). It is hopefully unlikely a bootloader would set up
CP0_EBase to a 64-bit segment and leave WG=0.
[ralf@linux-mips.org: Resolved merge conflict.]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Tested-by: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13223/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Octeon machines support running in little endian mode. U-Boot usually
runs in big endian-mode. Therefore the initramfs is loaded in big endian
mode, and the kernel later tries to access it in little endian mode.
This patch fixes that by detecting byte swapped initramfs using either the
CPIO header or the header from standard compression methods, and
byte swaps it if needed. It first checks that the header doesn't match
in the native endianness to avoid false detections. It uses the kernel
decompress library so that we don't have to maintain the list of magics
if some decompression methods are added to the kernel.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Acked-by: David Daney <david.daney@cavium.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13219/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
For XPA kernels build_update_entries() uses $1 (at) as a scratch
register, but doesn't arrange for it to be preserved, so it will always
be clobbered by the TLB refill exception. Although this register
normally has a very short lifetime that doesn't cross memory accesses,
TLB refills due to instruction fetches (either on a page boundary or
after preemption) could clobber live data, and its easy to reproduce
the clobber with a little bit of assembler code.
Note that the use of a hardware page table walker will partly mask the
problem, as the TLB refill handler will not always be invoked.
This is fixed by avoiding the use of the extra scratch register. The
pte_high parts (going into the lower half of the EntryLo registers) are
loaded and manipulated separately so as to keep the PTE pointer around
for the other halves (instead of storing in the scratch register), and
the pte_low parts (going into the high half of the EntryLo registers)
are masked with 0x00ffffff using an ext instruction (instead of loading
0x00ffffff into the scratch register and AND'ing).
[paul.burton@imgtec.com:
- Rebase atop other TLB work.
- Use ext instead of an sll, srl sequence.
- Use cpu_has_xpa instead of #ifdefs.
- Modify commit subject to include "mm".]
Fixes: c5b367835c ("MIPS: Add support for XPA.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/13120/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
asm/pgtable-bits.h has grown to become an unreadable mess of #ifdef
directives defining bits conditionally upon other bits all at the
preprocessing stage, for no good reason.
Instead of having quite so many #ifdef's, simply use enums to provide
sequential numbering for bit shifts, without having to keep track
manually of what the last bit defined was. Masks are defined separately,
after the shifts, which allows for most of their definitions to be
reused for all systems rather than duplicated.
This patch is not intended to make any behavioural change to the code -
all bits should be used in the same way they were before this patch.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Alex Smith <alex.smith@imgtec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13115/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
XPA (eXtended Physical Addressing) should be detected as a combination
of two architectural features:
- Large Physical Address (as per Config3.LPA). With XPA this will be set
on MIPS32r5 cores, but it may also be set for MIPS64r2 cores too.
- MTHC0/MFHC0 instructions (as per Config5.MVH). With XPA this will be
set, but it may also be set in VZ guest context even when Config3.LPA
in the guest context has been cleared by the hypervisor.
As such, XPA is only usable if both bits are set. Update CPU features to
separate these two features, with cpu_has_xpa requiring both to be set.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Maciej W. Rozycki <macro@imgtec.com>
Cc: Joshua Kinard <kumba@gentoo.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13112/
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
When emulating a jalr instruction with rd == $0, the code in
isBranchInstr was incorrectly writing to GPR $0 which should actually
always remain zeroed. This would lead to any further instructions
emulated which use $0 operating on a bogus value until the task is next
context switched, at which point the value of $0 in the task context
would be restored to the correct zero by a store in SAVE_SOME. Fix this
by not writing to rd if it is $0.
Fixes: 102cedc32a ("MIPS: microMIPS: Floating point support.")
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Maciej W. Rozycki <macro@imgtec.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Cc: stable <stable@vger.kernel.org> # v3.10
Patchwork: https://patchwork.linux-mips.org/patch/13160/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The code in _sp_maddf (formerly ieee754sp_madd) appears to have been
copied verbatim from ieee754sp_add, and although it's adding the
unpacked "r" & "z" floats it kept using macros that operate on "x" &
"y". This led to the addition being carried out incorrectly on some
mismash of the product, accumulator & multiplicand fields. Typically
this would lead to the assertions "ze == re" & "ze <= SP_EMAX" failing
since ze & re hadn't been operated upon.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes: e24c3bec3e ("MIPS: math-emu: Add support for the MIPS R6 MADDF FPU instruction")
Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Maciej W. Rozycki <macro@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/13159/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>