
Pull x86 entry updates from Thomas Gleixner:
"The x86 entry, exception and interrupt code rework
This all started about 6 month ago with the attempt to move the Posix
CPU timer heavy lifting out of the timer interrupt code and just have
lockless quick checks in that code path. Trivial 5 patches.
This unearthed an inconsistency in the KVM handling of task work and
the review requested to move all of this into generic code so other
architectures can share.
Valid request and solved with another 25 patches but those unearthed
inconsistencies vs. RCU and instrumentation.
Digging into this made it obvious that there are quite some
inconsistencies vs. instrumentation in general. The int3 text poke
handling in particular was completely unprotected and with the batched
update of trace events even more likely to expose to endless int3
recursion.
In parallel the RCU implications of instrumenting fragile entry code
came up in several discussions.
The conclusion of the x86 maintainer team was to go all the way and
make the protection against any form of instrumentation of fragile and
dangerous code pathes enforcable and verifiable by tooling.
A first batch of preparatory work hit mainline with commit
d5f744f9a2
("Pull x86 entry code updates from Thomas Gleixner")
That (almost) full solution introduced a new code section
'.noinstr.text' into which all code which needs to be protected from
instrumentation of all sorts goes into. Any call into instrumentable
code out of this section has to be annotated. objtool has support to
validate this.
Kprobes now excludes this section fully which also prevents BPF from
fiddling with it and all 'noinstr' annotated functions also keep
ftrace off. The section, kprobes and objtool changes are already
merged.
The major changes coming with this are:
- Preparatory cleanups
- Annotating of relevant functions to move them into the
noinstr.text section or enforcing inlining by marking them
__always_inline so the compiler cannot misplace or instrument
them.
- Splitting and simplifying the idtentry macro maze so that it is
now clearly separated into simple exception entries and the more
interesting ones which use interrupt stacks and have the paranoid
handling vs. CR3 and GS.
- Move quite some of the low level ASM functionality into C code:
- enter_from and exit to user space handling. The ASM code now
calls into C after doing the really necessary ASM handling and
the return path goes back out without bells and whistels in
ASM.
- exception entry/exit got the equivivalent treatment
- move all IRQ tracepoints from ASM to C so they can be placed as
appropriate which is especially important for the int3
recursion issue.
- Consolidate the declaration and definition of entry points between
32 and 64 bit. They share a common header and macros now.
- Remove the extra device interrupt entry maze and just use the
regular exception entry code.
- All ASM entry points except NMI are now generated from the shared
header file and the corresponding macros in the 32 and 64 bit
entry ASM.
- The C code entry points are consolidated as well with the help of
DEFINE_IDTENTRY*() macros. This allows to ensure at one central
point that all corresponding entry points share the same
semantics. The actual function body for most entry points is in an
instrumentable and sane state.
There are special macros for the more sensitive entry points, e.g.
INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
They allow to put the whole entry instrumentation and RCU handling
into safe places instead of the previous pray that it is correct
approach.
- The INT3 text poke handling is now completely isolated and the
recursion issue banned. Aside of the entry rework this required
other isolation work, e.g. the ability to force inline bsearch.
- Prevent #DB on fragile entry code, entry relevant memory and
disable it on NMI, #MC entry, which allowed to get rid of the
nested #DB IST stack shifting hackery.
- A few other cleanups and enhancements which have been made
possible through this and already merged changes, e.g.
consolidating and further restricting the IDT code so the IDT
table becomes RO after init which removes yet another popular
attack vector
- About 680 lines of ASM maze are gone.
There are a few open issues:
- An escape out of the noinstr section in the MCE handler which needs
some more thought but under the aspect that MCE is a complete
trainwreck by design and the propability to survive it is low, this
was not high on the priority list.
- Paravirtualization
When PV is enabled then objtool complains about a bunch of indirect
calls out of the noinstr section. There are a few straight forward
ways to fix this, but the other issues vs. general correctness were
more pressing than parawitz.
- KVM
KVM is inconsistent as well. Patches have been posted, but they
have not yet been commented on or picked up by the KVM folks.
- IDLE
Pretty much the same problems can be found in the low level idle
code especially the parts where RCU stopped watching. This was
beyond the scope of the more obvious and exposable problems and is
on the todo list.
The lesson learned from this brain melting exercise to morph the
evolved code base into something which can be validated and understood
is that once again the violation of the most important engineering
principle "correctness first" has caused quite a few people to spend
valuable time on problems which could have been avoided in the first
place. The "features first" tinkering mindset really has to stop.
With that I want to say thanks to everyone involved in contributing to
this effort. Special thanks go to the following people (alphabetical
order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
Vitaly Kuznetsov, and Will Deacon"
* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
x86/entry: Force rcu_irq_enter() when in idle task
x86/entry: Make NMI use IDTENTRY_RAW
x86/entry: Treat BUG/WARN as NMI-like entries
x86/entry: Unbreak __irqentry_text_start/end magic
x86/entry: __always_inline CR2 for noinstr
lockdep: __always_inline more for noinstr
x86/entry: Re-order #DB handler to avoid *SAN instrumentation
x86/entry: __always_inline arch_atomic_* for noinstr
x86/entry: __always_inline irqflags for noinstr
x86/entry: __always_inline debugreg for noinstr
x86/idt: Consolidate idt functionality
x86/idt: Cleanup trap_init()
x86/idt: Use proper constants for table size
x86/idt: Add comments about early #PF handling
x86/idt: Mark init only functions __init
x86/entry: Rename trace_hardirqs_off_prepare()
x86/entry: Clarify irq_{enter,exit}_rcu()
x86/entry: Remove DBn stacks
x86/entry: Remove debug IDT frobbing
x86/entry: Optimize local_db_save() for virt
...
918 lines
21 KiB
C
918 lines
21 KiB
C
// SPDX-License-Identifier: GPL-2.0-or-later
|
|
/*
|
|
* KVM paravirt_ops implementation
|
|
*
|
|
* Copyright (C) 2007, Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
|
|
* Copyright IBM Corporation, 2007
|
|
* Authors: Anthony Liguori <aliguori@us.ibm.com>
|
|
*/
|
|
|
|
#include <linux/context_tracking.h>
|
|
#include <linux/init.h>
|
|
#include <linux/kernel.h>
|
|
#include <linux/kvm_para.h>
|
|
#include <linux/cpu.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/highmem.h>
|
|
#include <linux/hardirq.h>
|
|
#include <linux/notifier.h>
|
|
#include <linux/reboot.h>
|
|
#include <linux/hash.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/kprobes.h>
|
|
#include <linux/nmi.h>
|
|
#include <linux/swait.h>
|
|
#include <asm/timer.h>
|
|
#include <asm/cpu.h>
|
|
#include <asm/traps.h>
|
|
#include <asm/desc.h>
|
|
#include <asm/tlbflush.h>
|
|
#include <asm/apic.h>
|
|
#include <asm/apicdef.h>
|
|
#include <asm/hypervisor.h>
|
|
#include <asm/tlb.h>
|
|
#include <asm/cpuidle_haltpoll.h>
|
|
|
|
DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
|
|
|
|
static int kvmapf = 1;
|
|
|
|
static int __init parse_no_kvmapf(char *arg)
|
|
{
|
|
kvmapf = 0;
|
|
return 0;
|
|
}
|
|
|
|
early_param("no-kvmapf", parse_no_kvmapf);
|
|
|
|
static int steal_acc = 1;
|
|
static int __init parse_no_stealacc(char *arg)
|
|
{
|
|
steal_acc = 0;
|
|
return 0;
|
|
}
|
|
|
|
early_param("no-steal-acc", parse_no_stealacc);
|
|
|
|
static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
|
|
DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64) __visible;
|
|
static int has_steal_clock = 0;
|
|
|
|
/*
|
|
* No need for any "IO delay" on KVM
|
|
*/
|
|
static void kvm_io_delay(void)
|
|
{
|
|
}
|
|
|
|
#define KVM_TASK_SLEEP_HASHBITS 8
|
|
#define KVM_TASK_SLEEP_HASHSIZE (1<<KVM_TASK_SLEEP_HASHBITS)
|
|
|
|
struct kvm_task_sleep_node {
|
|
struct hlist_node link;
|
|
struct swait_queue_head wq;
|
|
u32 token;
|
|
int cpu;
|
|
};
|
|
|
|
static struct kvm_task_sleep_head {
|
|
raw_spinlock_t lock;
|
|
struct hlist_head list;
|
|
} async_pf_sleepers[KVM_TASK_SLEEP_HASHSIZE];
|
|
|
|
static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
|
|
u32 token)
|
|
{
|
|
struct hlist_node *p;
|
|
|
|
hlist_for_each(p, &b->list) {
|
|
struct kvm_task_sleep_node *n =
|
|
hlist_entry(p, typeof(*n), link);
|
|
if (n->token == token)
|
|
return n;
|
|
}
|
|
|
|
return NULL;
|
|
}
|
|
|
|
static bool kvm_async_pf_queue_task(u32 token, struct kvm_task_sleep_node *n)
|
|
{
|
|
u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
|
|
struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
|
|
struct kvm_task_sleep_node *e;
|
|
|
|
raw_spin_lock(&b->lock);
|
|
e = _find_apf_task(b, token);
|
|
if (e) {
|
|
/* dummy entry exist -> wake up was delivered ahead of PF */
|
|
hlist_del(&e->link);
|
|
raw_spin_unlock(&b->lock);
|
|
kfree(e);
|
|
return false;
|
|
}
|
|
|
|
n->token = token;
|
|
n->cpu = smp_processor_id();
|
|
init_swait_queue_head(&n->wq);
|
|
hlist_add_head(&n->link, &b->list);
|
|
raw_spin_unlock(&b->lock);
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
|
|
* @token: Token to identify the sleep node entry
|
|
*
|
|
* Invoked from the async pagefault handling code or from the VM exit page
|
|
* fault handler. In both cases RCU is watching.
|
|
*/
|
|
void kvm_async_pf_task_wait_schedule(u32 token)
|
|
{
|
|
struct kvm_task_sleep_node n;
|
|
DECLARE_SWAITQUEUE(wait);
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
if (!kvm_async_pf_queue_task(token, &n))
|
|
return;
|
|
|
|
for (;;) {
|
|
prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
|
|
if (hlist_unhashed(&n.link))
|
|
break;
|
|
|
|
local_irq_enable();
|
|
schedule();
|
|
local_irq_disable();
|
|
}
|
|
finish_swait(&n.wq, &wait);
|
|
}
|
|
EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
|
|
|
|
static void apf_task_wake_one(struct kvm_task_sleep_node *n)
|
|
{
|
|
hlist_del_init(&n->link);
|
|
if (swq_has_sleeper(&n->wq))
|
|
swake_up_one(&n->wq);
|
|
}
|
|
|
|
static void apf_task_wake_all(void)
|
|
{
|
|
int i;
|
|
|
|
for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++) {
|
|
struct kvm_task_sleep_head *b = &async_pf_sleepers[i];
|
|
struct kvm_task_sleep_node *n;
|
|
struct hlist_node *p, *next;
|
|
|
|
raw_spin_lock(&b->lock);
|
|
hlist_for_each_safe(p, next, &b->list) {
|
|
n = hlist_entry(p, typeof(*n), link);
|
|
if (n->cpu == smp_processor_id())
|
|
apf_task_wake_one(n);
|
|
}
|
|
raw_spin_unlock(&b->lock);
|
|
}
|
|
}
|
|
|
|
void kvm_async_pf_task_wake(u32 token)
|
|
{
|
|
u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
|
|
struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
|
|
struct kvm_task_sleep_node *n;
|
|
|
|
if (token == ~0) {
|
|
apf_task_wake_all();
|
|
return;
|
|
}
|
|
|
|
again:
|
|
raw_spin_lock(&b->lock);
|
|
n = _find_apf_task(b, token);
|
|
if (!n) {
|
|
/*
|
|
* async PF was not yet handled.
|
|
* Add dummy entry for the token.
|
|
*/
|
|
n = kzalloc(sizeof(*n), GFP_ATOMIC);
|
|
if (!n) {
|
|
/*
|
|
* Allocation failed! Busy wait while other cpu
|
|
* handles async PF.
|
|
*/
|
|
raw_spin_unlock(&b->lock);
|
|
cpu_relax();
|
|
goto again;
|
|
}
|
|
n->token = token;
|
|
n->cpu = smp_processor_id();
|
|
init_swait_queue_head(&n->wq);
|
|
hlist_add_head(&n->link, &b->list);
|
|
} else {
|
|
apf_task_wake_one(n);
|
|
}
|
|
raw_spin_unlock(&b->lock);
|
|
return;
|
|
}
|
|
EXPORT_SYMBOL_GPL(kvm_async_pf_task_wake);
|
|
|
|
noinstr u32 kvm_read_and_reset_apf_flags(void)
|
|
{
|
|
u32 flags = 0;
|
|
|
|
if (__this_cpu_read(apf_reason.enabled)) {
|
|
flags = __this_cpu_read(apf_reason.flags);
|
|
__this_cpu_write(apf_reason.flags, 0);
|
|
}
|
|
|
|
return flags;
|
|
}
|
|
EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
|
|
|
|
noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
|
|
{
|
|
u32 reason = kvm_read_and_reset_apf_flags();
|
|
bool rcu_exit;
|
|
|
|
switch (reason) {
|
|
case KVM_PV_REASON_PAGE_NOT_PRESENT:
|
|
case KVM_PV_REASON_PAGE_READY:
|
|
break;
|
|
default:
|
|
return false;
|
|
}
|
|
|
|
rcu_exit = idtentry_enter_cond_rcu(regs);
|
|
instrumentation_begin();
|
|
|
|
/*
|
|
* If the host managed to inject an async #PF into an interrupt
|
|
* disabled region, then die hard as this is not going to end well
|
|
* and the host side is seriously broken.
|
|
*/
|
|
if (unlikely(!(regs->flags & X86_EFLAGS_IF)))
|
|
panic("Host injected async #PF in interrupt disabled region\n");
|
|
|
|
if (reason == KVM_PV_REASON_PAGE_NOT_PRESENT) {
|
|
if (unlikely(!(user_mode(regs))))
|
|
panic("Host injected async #PF in kernel mode\n");
|
|
/* Page is swapped out by the host. */
|
|
kvm_async_pf_task_wait_schedule(token);
|
|
} else {
|
|
kvm_async_pf_task_wake(token);
|
|
}
|
|
|
|
instrumentation_end();
|
|
idtentry_exit_cond_rcu(regs, rcu_exit);
|
|
return true;
|
|
}
|
|
|
|
static void __init paravirt_ops_setup(void)
|
|
{
|
|
pv_info.name = "KVM";
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
|
|
pv_ops.cpu.io_delay = kvm_io_delay;
|
|
|
|
#ifdef CONFIG_X86_IO_APIC
|
|
no_timer_check = 1;
|
|
#endif
|
|
}
|
|
|
|
static void kvm_register_steal_time(void)
|
|
{
|
|
int cpu = smp_processor_id();
|
|
struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
|
|
|
|
if (!has_steal_clock)
|
|
return;
|
|
|
|
wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
|
|
pr_info("kvm-stealtime: cpu %d, msr %llx\n",
|
|
cpu, (unsigned long long) slow_virt_to_phys(st));
|
|
}
|
|
|
|
static DEFINE_PER_CPU_DECRYPTED(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
|
|
|
|
static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
|
|
{
|
|
/**
|
|
* This relies on __test_and_clear_bit to modify the memory
|
|
* in a way that is atomic with respect to the local CPU.
|
|
* The hypervisor only accesses this memory from the local CPU so
|
|
* there's no need for lock or memory barriers.
|
|
* An optimization barrier is implied in apic write.
|
|
*/
|
|
if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi)))
|
|
return;
|
|
apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
|
|
}
|
|
|
|
static void kvm_guest_cpu_init(void)
|
|
{
|
|
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
|
|
u64 pa;
|
|
|
|
WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
|
|
|
|
pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
|
|
pa |= KVM_ASYNC_PF_ENABLED;
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT))
|
|
pa |= KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
|
|
|
|
wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
|
|
__this_cpu_write(apf_reason.enabled, 1);
|
|
pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
|
|
}
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
|
|
unsigned long pa;
|
|
|
|
/* Size alignment is implied but just to make it explicit. */
|
|
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
|
|
__this_cpu_write(kvm_apic_eoi, 0);
|
|
pa = slow_virt_to_phys(this_cpu_ptr(&kvm_apic_eoi))
|
|
| KVM_MSR_ENABLED;
|
|
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
|
|
}
|
|
|
|
if (has_steal_clock)
|
|
kvm_register_steal_time();
|
|
}
|
|
|
|
static void kvm_pv_disable_apf(void)
|
|
{
|
|
if (!__this_cpu_read(apf_reason.enabled))
|
|
return;
|
|
|
|
wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
|
|
__this_cpu_write(apf_reason.enabled, 0);
|
|
|
|
pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
|
|
}
|
|
|
|
static void kvm_pv_guest_cpu_reboot(void *unused)
|
|
{
|
|
/*
|
|
* We disable PV EOI before we load a new kernel by kexec,
|
|
* since MSR_KVM_PV_EOI_EN stores a pointer into old kernel's memory.
|
|
* New kernel can re-enable when it boots.
|
|
*/
|
|
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
|
|
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
|
|
kvm_pv_disable_apf();
|
|
kvm_disable_steal_time();
|
|
}
|
|
|
|
static int kvm_pv_reboot_notify(struct notifier_block *nb,
|
|
unsigned long code, void *unused)
|
|
{
|
|
if (code == SYS_RESTART)
|
|
on_each_cpu(kvm_pv_guest_cpu_reboot, NULL, 1);
|
|
return NOTIFY_DONE;
|
|
}
|
|
|
|
static struct notifier_block kvm_pv_reboot_nb = {
|
|
.notifier_call = kvm_pv_reboot_notify,
|
|
};
|
|
|
|
static u64 kvm_steal_clock(int cpu)
|
|
{
|
|
u64 steal;
|
|
struct kvm_steal_time *src;
|
|
int version;
|
|
|
|
src = &per_cpu(steal_time, cpu);
|
|
do {
|
|
version = src->version;
|
|
virt_rmb();
|
|
steal = src->steal;
|
|
virt_rmb();
|
|
} while ((version & 1) || (version != src->version));
|
|
|
|
return steal;
|
|
}
|
|
|
|
void kvm_disable_steal_time(void)
|
|
{
|
|
if (!has_steal_clock)
|
|
return;
|
|
|
|
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
|
|
}
|
|
|
|
static inline void __set_percpu_decrypted(void *ptr, unsigned long size)
|
|
{
|
|
early_set_memory_decrypted((unsigned long) ptr, size);
|
|
}
|
|
|
|
/*
|
|
* Iterate through all possible CPUs and map the memory region pointed
|
|
* by apf_reason, steal_time and kvm_apic_eoi as decrypted at once.
|
|
*
|
|
* Note: we iterate through all possible CPUs to ensure that CPUs
|
|
* hotplugged will have their per-cpu variable already mapped as
|
|
* decrypted.
|
|
*/
|
|
static void __init sev_map_percpu_data(void)
|
|
{
|
|
int cpu;
|
|
|
|
if (!sev_active())
|
|
return;
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
__set_percpu_decrypted(&per_cpu(apf_reason, cpu), sizeof(apf_reason));
|
|
__set_percpu_decrypted(&per_cpu(steal_time, cpu), sizeof(steal_time));
|
|
__set_percpu_decrypted(&per_cpu(kvm_apic_eoi, cpu), sizeof(kvm_apic_eoi));
|
|
}
|
|
}
|
|
|
|
static bool pv_tlb_flush_supported(void)
|
|
{
|
|
return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
|
|
!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
|
|
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
|
|
}
|
|
|
|
static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
static bool pv_ipi_supported(void)
|
|
{
|
|
return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI);
|
|
}
|
|
|
|
static bool pv_sched_yield_supported(void)
|
|
{
|
|
return (kvm_para_has_feature(KVM_FEATURE_PV_SCHED_YIELD) &&
|
|
!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
|
|
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME));
|
|
}
|
|
|
|
#define KVM_IPI_CLUSTER_SIZE (2 * BITS_PER_LONG)
|
|
|
|
static void __send_ipi_mask(const struct cpumask *mask, int vector)
|
|
{
|
|
unsigned long flags;
|
|
int cpu, apic_id, icr;
|
|
int min = 0, max = 0;
|
|
#ifdef CONFIG_X86_64
|
|
__uint128_t ipi_bitmap = 0;
|
|
#else
|
|
u64 ipi_bitmap = 0;
|
|
#endif
|
|
long ret;
|
|
|
|
if (cpumask_empty(mask))
|
|
return;
|
|
|
|
local_irq_save(flags);
|
|
|
|
switch (vector) {
|
|
default:
|
|
icr = APIC_DM_FIXED | vector;
|
|
break;
|
|
case NMI_VECTOR:
|
|
icr = APIC_DM_NMI;
|
|
break;
|
|
}
|
|
|
|
for_each_cpu(cpu, mask) {
|
|
apic_id = per_cpu(x86_cpu_to_apicid, cpu);
|
|
if (!ipi_bitmap) {
|
|
min = max = apic_id;
|
|
} else if (apic_id < min && max - apic_id < KVM_IPI_CLUSTER_SIZE) {
|
|
ipi_bitmap <<= min - apic_id;
|
|
min = apic_id;
|
|
} else if (apic_id < min + KVM_IPI_CLUSTER_SIZE) {
|
|
max = apic_id < max ? max : apic_id;
|
|
} else {
|
|
ret = kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,
|
|
(unsigned long)(ipi_bitmap >> BITS_PER_LONG), min, icr);
|
|
WARN_ONCE(ret < 0, "KVM: failed to send PV IPI: %ld", ret);
|
|
min = max = apic_id;
|
|
ipi_bitmap = 0;
|
|
}
|
|
__set_bit(apic_id - min, (unsigned long *)&ipi_bitmap);
|
|
}
|
|
|
|
if (ipi_bitmap) {
|
|
ret = kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,
|
|
(unsigned long)(ipi_bitmap >> BITS_PER_LONG), min, icr);
|
|
WARN_ONCE(ret < 0, "KVM: failed to send PV IPI: %ld", ret);
|
|
}
|
|
|
|
local_irq_restore(flags);
|
|
}
|
|
|
|
static void kvm_send_ipi_mask(const struct cpumask *mask, int vector)
|
|
{
|
|
__send_ipi_mask(mask, vector);
|
|
}
|
|
|
|
static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int vector)
|
|
{
|
|
unsigned int this_cpu = smp_processor_id();
|
|
struct cpumask *new_mask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
|
|
const struct cpumask *local_mask;
|
|
|
|
cpumask_copy(new_mask, mask);
|
|
cpumask_clear_cpu(this_cpu, new_mask);
|
|
local_mask = new_mask;
|
|
__send_ipi_mask(local_mask, vector);
|
|
}
|
|
|
|
/*
|
|
* Set the IPI entry points
|
|
*/
|
|
static void kvm_setup_pv_ipi(void)
|
|
{
|
|
apic->send_IPI_mask = kvm_send_ipi_mask;
|
|
apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
|
|
pr_info("KVM setup pv IPIs\n");
|
|
}
|
|
|
|
static void kvm_smp_send_call_func_ipi(const struct cpumask *mask)
|
|
{
|
|
int cpu;
|
|
|
|
native_send_call_func_ipi(mask);
|
|
|
|
/* Make sure other vCPUs get a chance to run if they need to. */
|
|
for_each_cpu(cpu, mask) {
|
|
if (vcpu_is_preempted(cpu)) {
|
|
kvm_hypercall1(KVM_HC_SCHED_YIELD, per_cpu(x86_cpu_to_apicid, cpu));
|
|
break;
|
|
}
|
|
}
|
|
}
|
|
|
|
static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
|
|
{
|
|
native_smp_prepare_cpus(max_cpus);
|
|
if (kvm_para_has_hint(KVM_HINTS_REALTIME))
|
|
static_branch_disable(&virt_spin_lock_key);
|
|
}
|
|
|
|
static void __init kvm_smp_prepare_boot_cpu(void)
|
|
{
|
|
/*
|
|
* Map the per-cpu variables as decrypted before kvm_guest_cpu_init()
|
|
* shares the guest physical address with the hypervisor.
|
|
*/
|
|
sev_map_percpu_data();
|
|
|
|
kvm_guest_cpu_init();
|
|
native_smp_prepare_boot_cpu();
|
|
kvm_spinlock_init();
|
|
}
|
|
|
|
static void kvm_guest_cpu_offline(void)
|
|
{
|
|
kvm_disable_steal_time();
|
|
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
|
|
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
|
|
kvm_pv_disable_apf();
|
|
apf_task_wake_all();
|
|
}
|
|
|
|
static int kvm_cpu_online(unsigned int cpu)
|
|
{
|
|
local_irq_disable();
|
|
kvm_guest_cpu_init();
|
|
local_irq_enable();
|
|
return 0;
|
|
}
|
|
|
|
static int kvm_cpu_down_prepare(unsigned int cpu)
|
|
{
|
|
local_irq_disable();
|
|
kvm_guest_cpu_offline();
|
|
local_irq_enable();
|
|
return 0;
|
|
}
|
|
#endif
|
|
|
|
static void kvm_flush_tlb_others(const struct cpumask *cpumask,
|
|
const struct flush_tlb_info *info)
|
|
{
|
|
u8 state;
|
|
int cpu;
|
|
struct kvm_steal_time *src;
|
|
struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
|
|
|
|
cpumask_copy(flushmask, cpumask);
|
|
/*
|
|
* We have to call flush only on online vCPUs. And
|
|
* queue flush_on_enter for pre-empted vCPUs
|
|
*/
|
|
for_each_cpu(cpu, flushmask) {
|
|
src = &per_cpu(steal_time, cpu);
|
|
state = READ_ONCE(src->preempted);
|
|
if ((state & KVM_VCPU_PREEMPTED)) {
|
|
if (try_cmpxchg(&src->preempted, &state,
|
|
state | KVM_VCPU_FLUSH_TLB))
|
|
__cpumask_clear_cpu(cpu, flushmask);
|
|
}
|
|
}
|
|
|
|
native_flush_tlb_others(flushmask, info);
|
|
}
|
|
|
|
static void __init kvm_guest_init(void)
|
|
{
|
|
int i;
|
|
|
|
paravirt_ops_setup();
|
|
register_reboot_notifier(&kvm_pv_reboot_nb);
|
|
for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
|
|
raw_spin_lock_init(&async_pf_sleepers[i].lock);
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
|
|
has_steal_clock = 1;
|
|
pv_ops.time.steal_clock = kvm_steal_clock;
|
|
}
|
|
|
|
if (pv_tlb_flush_supported()) {
|
|
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
|
|
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
|
|
pr_info("KVM setup pv remote TLB flush\n");
|
|
}
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
|
|
apic_set_eoi_write(kvm_guest_apic_eoi_write);
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf)
|
|
static_branch_enable(&kvm_async_pf_enabled);
|
|
|
|
#ifdef CONFIG_SMP
|
|
smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
|
|
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
|
|
if (pv_sched_yield_supported()) {
|
|
smp_ops.send_call_func_ipi = kvm_smp_send_call_func_ipi;
|
|
pr_info("KVM setup pv sched yield\n");
|
|
}
|
|
if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "x86/kvm:online",
|
|
kvm_cpu_online, kvm_cpu_down_prepare) < 0)
|
|
pr_err("kvm_guest: Failed to install cpu hotplug callbacks\n");
|
|
#else
|
|
sev_map_percpu_data();
|
|
kvm_guest_cpu_init();
|
|
#endif
|
|
|
|
/*
|
|
* Hard lockup detection is enabled by default. Disable it, as guests
|
|
* can get false positives too easily, for example if the host is
|
|
* overcommitted.
|
|
*/
|
|
hardlockup_detector_disable();
|
|
}
|
|
|
|
static noinline uint32_t __kvm_cpuid_base(void)
|
|
{
|
|
if (boot_cpu_data.cpuid_level < 0)
|
|
return 0; /* So we don't blow up on old processors */
|
|
|
|
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
|
|
return hypervisor_cpuid_base("KVMKVMKVM\0\0\0", 0);
|
|
|
|
return 0;
|
|
}
|
|
|
|
static inline uint32_t kvm_cpuid_base(void)
|
|
{
|
|
static int kvm_cpuid_base = -1;
|
|
|
|
if (kvm_cpuid_base == -1)
|
|
kvm_cpuid_base = __kvm_cpuid_base();
|
|
|
|
return kvm_cpuid_base;
|
|
}
|
|
|
|
bool kvm_para_available(void)
|
|
{
|
|
return kvm_cpuid_base() != 0;
|
|
}
|
|
EXPORT_SYMBOL_GPL(kvm_para_available);
|
|
|
|
unsigned int kvm_arch_para_features(void)
|
|
{
|
|
return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
|
|
}
|
|
|
|
unsigned int kvm_arch_para_hints(void)
|
|
{
|
|
return cpuid_edx(kvm_cpuid_base() | KVM_CPUID_FEATURES);
|
|
}
|
|
EXPORT_SYMBOL_GPL(kvm_arch_para_hints);
|
|
|
|
static uint32_t __init kvm_detect(void)
|
|
{
|
|
return kvm_cpuid_base();
|
|
}
|
|
|
|
static void __init kvm_apic_init(void)
|
|
{
|
|
#if defined(CONFIG_SMP)
|
|
if (pv_ipi_supported())
|
|
kvm_setup_pv_ipi();
|
|
#endif
|
|
}
|
|
|
|
static void __init kvm_init_platform(void)
|
|
{
|
|
kvmclock_init();
|
|
x86_platform.apic_post_init = kvm_apic_init;
|
|
}
|
|
|
|
const __initconst struct hypervisor_x86 x86_hyper_kvm = {
|
|
.name = "KVM",
|
|
.detect = kvm_detect,
|
|
.type = X86_HYPER_KVM,
|
|
.init.guest_late_init = kvm_guest_init,
|
|
.init.x2apic_available = kvm_para_available,
|
|
.init.init_platform = kvm_init_platform,
|
|
};
|
|
|
|
static __init int activate_jump_labels(void)
|
|
{
|
|
if (has_steal_clock) {
|
|
static_key_slow_inc(¶virt_steal_enabled);
|
|
if (steal_acc)
|
|
static_key_slow_inc(¶virt_steal_rq_enabled);
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
arch_initcall(activate_jump_labels);
|
|
|
|
static __init int kvm_alloc_cpumask(void)
|
|
{
|
|
int cpu;
|
|
bool alloc = false;
|
|
|
|
if (!kvm_para_available() || nopv)
|
|
return 0;
|
|
|
|
if (pv_tlb_flush_supported())
|
|
alloc = true;
|
|
|
|
#if defined(CONFIG_SMP)
|
|
if (pv_ipi_supported())
|
|
alloc = true;
|
|
#endif
|
|
|
|
if (alloc)
|
|
for_each_possible_cpu(cpu) {
|
|
zalloc_cpumask_var_node(per_cpu_ptr(&__pv_cpu_mask, cpu),
|
|
GFP_KERNEL, cpu_to_node(cpu));
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
arch_initcall(kvm_alloc_cpumask);
|
|
|
|
#ifdef CONFIG_PARAVIRT_SPINLOCKS
|
|
|
|
/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
|
|
static void kvm_kick_cpu(int cpu)
|
|
{
|
|
int apicid;
|
|
unsigned long flags = 0;
|
|
|
|
apicid = per_cpu(x86_cpu_to_apicid, cpu);
|
|
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
|
|
}
|
|
|
|
#include <asm/qspinlock.h>
|
|
|
|
static void kvm_wait(u8 *ptr, u8 val)
|
|
{
|
|
unsigned long flags;
|
|
|
|
if (in_nmi())
|
|
return;
|
|
|
|
local_irq_save(flags);
|
|
|
|
if (READ_ONCE(*ptr) != val)
|
|
goto out;
|
|
|
|
/*
|
|
* halt until it's our turn and kicked. Note that we do safe halt
|
|
* for irq enabled case to avoid hang when lock info is overwritten
|
|
* in irq spinlock slowpath and no spurious interrupt occur to save us.
|
|
*/
|
|
if (arch_irqs_disabled_flags(flags))
|
|
halt();
|
|
else
|
|
safe_halt();
|
|
|
|
out:
|
|
local_irq_restore(flags);
|
|
}
|
|
|
|
#ifdef CONFIG_X86_32
|
|
__visible bool __kvm_vcpu_is_preempted(long cpu)
|
|
{
|
|
struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
|
|
|
|
return !!(src->preempted & KVM_VCPU_PREEMPTED);
|
|
}
|
|
PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
|
|
|
|
#else
|
|
|
|
#include <asm/asm-offsets.h>
|
|
|
|
extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
|
|
|
|
/*
|
|
* Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
|
|
* restoring to/from the stack.
|
|
*/
|
|
asm(
|
|
".pushsection .text;"
|
|
".global __raw_callee_save___kvm_vcpu_is_preempted;"
|
|
".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
|
|
"__raw_callee_save___kvm_vcpu_is_preempted:"
|
|
"movq __per_cpu_offset(,%rdi,8), %rax;"
|
|
"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
|
|
"setne %al;"
|
|
"ret;"
|
|
".size __raw_callee_save___kvm_vcpu_is_preempted, .-__raw_callee_save___kvm_vcpu_is_preempted;"
|
|
".popsection");
|
|
|
|
#endif
|
|
|
|
/*
|
|
* Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
|
|
*/
|
|
void __init kvm_spinlock_init(void)
|
|
{
|
|
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
|
|
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
|
|
return;
|
|
|
|
if (kvm_para_has_hint(KVM_HINTS_REALTIME))
|
|
return;
|
|
|
|
/* Don't use the pvqspinlock code if there is only 1 vCPU. */
|
|
if (num_possible_cpus() == 1)
|
|
return;
|
|
|
|
__pv_init_lock_hash();
|
|
pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
|
|
pv_ops.lock.queued_spin_unlock =
|
|
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
|
|
pv_ops.lock.wait = kvm_wait;
|
|
pv_ops.lock.kick = kvm_kick_cpu;
|
|
|
|
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
|
|
pv_ops.lock.vcpu_is_preempted =
|
|
PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
|
|
}
|
|
}
|
|
|
|
#endif /* CONFIG_PARAVIRT_SPINLOCKS */
|
|
|
|
#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
|
|
|
|
static void kvm_disable_host_haltpoll(void *i)
|
|
{
|
|
wrmsrl(MSR_KVM_POLL_CONTROL, 0);
|
|
}
|
|
|
|
static void kvm_enable_host_haltpoll(void *i)
|
|
{
|
|
wrmsrl(MSR_KVM_POLL_CONTROL, 1);
|
|
}
|
|
|
|
void arch_haltpoll_enable(unsigned int cpu)
|
|
{
|
|
if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL)) {
|
|
pr_err_once("kvm: host does not support poll control\n");
|
|
pr_err_once("kvm: host upgrade recommended\n");
|
|
return;
|
|
}
|
|
|
|
/* Enable guest halt poll disables host halt poll */
|
|
smp_call_function_single(cpu, kvm_disable_host_haltpoll, NULL, 1);
|
|
}
|
|
EXPORT_SYMBOL_GPL(arch_haltpoll_enable);
|
|
|
|
void arch_haltpoll_disable(unsigned int cpu)
|
|
{
|
|
if (!kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
|
|
return;
|
|
|
|
/* Enable guest halt poll disables host halt poll */
|
|
smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
|
|
}
|
|
EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
|
|
#endif
|