Commit Graph

10246 Commits

Author SHA1 Message Date
Peter Zijlstra
c7aadc0932 x86/entry: Increase entry_stack size to a full page
Marco crashed in bad_iret with a Clang11/KCSAN build due to
overflowing the stack. Now that we run C code on it, expand it to a
full page.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Reported-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200618144801.819246178@infradead.org
2020-06-25 13:45:40 +02:00
Linus Torvalds
26e122e97a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "All bugfixes except for a couple cleanup patches"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru
  KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling
  KVM: X86: Fix MSR range of APIC registers in X2APIC mode
  KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL
  KVM: nVMX: Plumb L2 GPA through to PML emulation
  KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic()
  KVM: LAPIC: ensure APIC map is up to date on concurrent update requests
  kvm: lapic: fix broken vcpu hotplug
  Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU"
  KVM: VMX: Add helpers to identify interrupt type from intr_info
  kvm/svm: disable KCSAN for svm_vcpu_run()
  KVM: MIPS: Fix a build error for !CPU_LOONGSON64
2020-06-23 11:01:16 -07:00
Sean Christopherson
bf09fb6cba KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL
Remove support for context switching between the guest's and host's
desired UMWAIT_CONTROL.  Propagating the guest's value to hardware isn't
required for correct functionality, e.g. KVM intercepts reads and writes
to the MSR, and the latency effects of the settings controlled by the
MSR are not architecturally visible.

As a general rule, KVM should not allow the guest to control power
management settings unless explicitly enabled by userspace, e.g. see
KVM_CAP_X86_DISABLE_EXITS.  E.g. Intel's SDM explicitly states that C0.2
can improve the performance of SMT siblings.  A devious guest could
disable C0.2 so as to improve the performance of their workloads at the
detriment to workloads running in the host or on other VMs.

Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
condition where updates from the host may cause KVM to enter the guest
with the incorrect value.  Because updates are are propagated to all
CPUs via IPI (SMP function callback), the value in hardware may be
stale with respect to the cached value and KVM could enter the guest
with the wrong value in hardware.  As above, the guest can't observe the
bad value, but it's a weird and confusing wart in the implementation.

Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
lists.  Using the lists is only necessary for MSRs that are required for
correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
old hardware, or for MSRs that need to-the-uop precision, e.g. perf
related MSRs.  For UMWAIT_CONTROL, the effects are only visible in the
kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
vcpu_vmx_run().  Using the atomic lists is undesirable as they are more
expensive than direct RDMSR/WRMSR.

Furthermore, even if giving the guest control of the MSR is legitimate,
e.g. in pass-through scenarios, it's not clear that the benefits would
outweigh the overhead.  E.g. saving and restoring an MSR across a VMX
roundtrip costs ~250 cycles, and if the guest diverged from the host
that cost would be paid on every run of the guest.  In other words, if
there is a legitimate use case then it should be enabled by a new
per-VM capability.

Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
UMWAIT and UMONITOR.

Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Cc: stable@vger.kernel.org
Cc: Jingqi Liu <jingqi.liu@intel.com>
Cc: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22 20:54:57 -04:00
Sean Christopherson
2dbebf7ae1 KVM: nVMX: Plumb L2 GPA through to PML emulation
Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
intents and purposes is vmx_write_pml_buffer(), instead of having the
latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS.  If the dirty bit
update is the result of KVM emulation (rare for L2), then the GPA in the
VMCS may be stale and/or hold a completely unrelated GPA.

Fixes: c5f983f6e8 ("nVMX: Implement emulated Page Modification Logging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22 18:23:03 -04:00
Paolo Bonzini
44d5271707 KVM: LAPIC: ensure APIC map is up to date on concurrent update requests
The following race can cause lost map update events:

         cpu1                            cpu2

                                apic_map_dirty = true
  ------------------------------------------------------------
                                kvm_recalculate_apic_map:
                                     pass check
                                         mutex_lock(&kvm->arch.apic_map_lock);
                                         if (!kvm->arch.apic_map_dirty)
                                     and in process of updating map
  -------------------------------------------------------------
    other calls to
       apic_map_dirty = true         might be too late for affected cpu
  -------------------------------------------------------------
                                     apic_map_dirty = false
  -------------------------------------------------------------
    kvm_recalculate_apic_map:
    bail out on
      if (!kvm->arch.apic_map_dirty)

To fix it, record the beginning of an update of the APIC map in
apic_map_dirty.  If another APIC map change switches apic_map_dirty
back to DIRTY during the update, kvm_recalculate_apic_map should not
make it CLEAN, and the other caller will go through the slow path.

Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22 13:37:30 -04:00
Borislav Petkov
99e40204e0 x86/msr: Move the F15h MSRs where they belong
1068ed4547 ("x86/msr: Lift AMD family 0x15 power-specific MSRs")

moved the three F15h power MSRs to the architectural list but that was
wrong as they belong in the family 0x15 list. That also caused:

  In file included from trace/beauty/tracepoints/x86_msr.c:10:
  perf/trace/beauty/generated/x86_arch_MSRs_array.c:292:45: error: initialized field overwritten [-Werror=override-init]
    292 |  [0xc0010280 - x86_AMD_V_KVM_MSRs_offset] = "F15H_PTSC",
        |                                             ^~~~~~~~~~~
  perf/trace/beauty/generated/x86_arch_MSRs_array.c:292:45: note: (near initialization for 'x86_AMD_V_KVM_MSRs[640]')

due to MSR_F15H_PTSC ending up being defined twice. Move them where they
belong and drop the duplicate.

Also, drop the respective tools/ changes of the msr-index.h copy the
above commit added because perf tool developers prefer to go through
those changes themselves in order to figure out whether changes to the
kernel headers would need additional handling in perf.

Fixes: 1068ed4547 ("x86/msr: Lift AMD family 0x15 power-specific MSRs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lkml.kernel.org/r/20200621163323.14e8533f@canb.auug.org.au
2020-06-22 17:15:53 +02:00
Benjamin Thiel
56ce93700e x86/mm/32: Fix -Wmissing prototypes warnings for init.c
Fix:

  arch/x86/mm/init.c:503:21:
  warning: no previous prototype for ‘init_memory_mapping’ [-Wmissing-prototypes]
  unsigned long __ref init_memory_mapping(unsigned long start,

  arch/x86/mm/init.c:745:13:
  warning: no previous prototype for ‘poking_init’ [-Wmissing-prototypes]
  void __init poking_init(void)

Lift init_memory_mapping() and poking_init() out of the ifdef
CONFIG_X86_64 to make the functions visible on 32-bit too.

Signed-off-by: Benjamin Thiel <b.thiel@posteo.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200606123743.3277-1-b.thiel@posteo.de
2020-06-18 18:04:00 +02:00
Andi Kleen
742c45c3ec x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
The kernel needs to explicitly enable FSGSBASE. So, the application needs
to know if it can safely use these instructions. Just looking at the CPUID
bit is not enough because it may be running in a kernel that does not
enable the instructions.

One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want to overwrite
the signal handlers of the main application.

Enumerate the enabled FSGSBASE capability in bit 1 of AT_HWCAP2 in the ELF
aux vector. AT_HWCAP2 is already used by PPC for similar purposes.

The application can access it open coded or by using the getauxval()
function in newer versions of glibc.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1557309753-24073-18-git-send-email-chang.seok.bae@intel.com
Link: https://lkml.kernel.org/r/20200528201402.1708239-14-sashal@kernel.org
2020-06-18 15:47:05 +02:00
Chang S. Bae
eaad981291 x86/entry/64: Introduce the FIND_PERCPU_BASE macro
GSBASE is used to find per-CPU data in the kernel. But when GSBASE is
unknown, the per-CPU base can be found from the per_cpu_offset table with a
CPU NR.  The CPU NR is extracted from the limit field of the CPUNODE entry
in GDT, or by the RDPID instruction. This is a prerequisite for using
FSGSBASE in the low level entry code.

Also, add the GAS-compatible RDPID macro as binutils 2.23 do not support
it. Support is added in version 2.27.

[ tglx: Massaged changelog ]

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1557309753-24073-12-git-send-email-chang.seok.bae@intel.com
Link: https://lkml.kernel.org/r/20200528201402.1708239-11-sashal@kernel.org
2020-06-18 15:47:04 +02:00
Thomas Gleixner
6758034e4d x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE
save_fsgs_for_kvm() is invoked via

  vcpu_enter_guest()
    kvm_x86_ops.prepare_guest_switch(vcpu)
      vmx_prepare_switch_to_guest()
        save_fsgs_for_kvm()

with preemption disabled, but interrupts enabled.

The upcoming FSGSBASE based GS safe needs interrupts to be disabled. This
could be done in the helper function, but that function is also called from
switch_to() which has interrupts disabled already.

Disable interrupts inside save_fsgs_for_kvm() and rename the function to
current_save_fsgs() so it can be invoked from other places.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200528201402.1708239-7-sashal@kernel.org
2020-06-18 15:47:01 +02:00
Chang S. Bae
58edfd2e0a x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
Add cpu feature conditional FSGSBASE access to the relevant helper
functions. That allows to accelerate certain FS/GS base operations in
subsequent changes.

Note, that while possible, the user space entry/exit GSBASE operations are
not going to use the new FSGSBASE instructions. The reason is that it would
require additional storage for the user space value which adds more
complexity to the low level code and experiments have shown marginal
benefit. This may be revisited later but for now the SWAPGS based handling
in the entry code is preserved except for the paranoid entry/exit code.

To preserve the SWAPGS entry mechanism introduce __[rd|wr]gsbase_inactive()
helpers. Note, for Xen PV, paravirt hooks can be added later as they might
allow a very efficient but different implementation.

[ tglx: Massaged changelog, convert it to noinstr and force inline
  	native_swapgs() ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1557309753-24073-7-git-send-email-chang.seok.bae@intel.com
Link: https://lkml.kernel.org/r/20200528201402.1708239-5-sashal@kernel.org
2020-06-18 15:47:00 +02:00
Andi Kleen
b15378ca50 x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
[ luto: Rename the variables from FS and GS to FSBASE and GSBASE and
  make <asm/fsgsbase.h> safe to include on 32-bit kernels. ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1557309753-24073-6-git-send-email-chang.seok.bae@intel.com
Link: https://lkml.kernel.org/r/20200528201402.1708239-4-sashal@kernel.org
2020-06-18 15:47:00 +02:00
Brian Gerst
c9a1ff316b x86/stackprotector: Pre-initialize canary for secondary CPUs
The idle tasks created for each secondary CPU already have a random stack
canary generated by fork().  Copy the canary to the percpu variable before
starting the secondary CPU which removes the need to call
boot_init_stack_canary().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200617225624.799335-1-brgerst@gmail.com
2020-06-18 13:09:17 +02:00
Christoph Hellwig
fe557319aa maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Benjamin Thiel
d5249bc7a1 x86/mm: Fix -Wmissing-prototypes warnings for arch/x86/mm/init.c
Fix -Wmissing-prototypes warnings:

  arch/x86/mm/init.c:81:6:
  warning: no previous prototype for ‘x86_has_pat_wp’ [-Wmissing-prototypes]
  bool x86_has_pat_wp(void)

  arch/x86/mm/init.c:86:22:
  warning: no previous prototype for ‘pgprot2cachemode’ [-Wmissing-prototypes]
  enum page_cache_mode pgprot2cachemode(pgprot_t pgprot)

by including the respective header containing prototypes. Also fix:

  arch/x86/mm/init.c:893:13:
  warning: no previous prototype for ‘mem_encrypt_free_decrypted_mem’ [-Wmissing-prototypes]
  void __weak mem_encrypt_free_decrypted_mem(void) { }

by making it static inline for the !CONFIG_AMD_MEM_ENCRYPT case. This
warning happens when CONFIG_AMD_MEM_ENCRYPT is not enabled (defconfig
for example):

  ./arch/x86/include/asm/mem_encrypt.h:80:27:
  warning: inline function ‘mem_encrypt_free_decrypted_mem’ declared weak [-Wattributes]
  static inline void __weak mem_encrypt_free_decrypted_mem(void) { }
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It's ok to convert to static inline because the function is used only in
x86. Is not shared with other architectures so drop the __weak too.

 [ bp: Massage and adjust __weak comments while at it. ]

Signed-off-by: Benjamin Thiel <b.thiel@posteo.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200606122629.2720-1-b.thiel@posteo.de
2020-06-17 10:45:46 +02:00
Borislav Petkov
28b60197b5 x86/asm: Unify __ASSEMBLY__ blocks
Merge the two ifndef __ASSEMBLY__ blocks.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200604133204.7636-1-bp@alien8.de
2020-06-15 19:29:36 +02:00
Borislav Petkov
fbd5969d1f x86/cpufeatures: Mark two free bits in word 3
... so that they get reused when needed.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200604104150.2056-1-bp@alien8.de
2020-06-15 19:26:23 +02:00
Borislav Petkov
1068ed4547 x86/msr: Lift AMD family 0x15 power-specific MSRs
... into the global msr-index.h header because they're used in multiple
compilation units. Sort the MSR list a bit. Update the msr-index.h copy
in tools.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lkml.kernel.org/r/20200608164847.14232-1-bp@alien8.de
2020-06-15 19:25:53 +02:00
Sean Christopherson
5d5103595e x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup
Reinitialize IA32_FEAT_CTL on the BSP during wakeup to handle the case
where firmware doesn't initialize or save/restore across S3.  This fixes
a bug where IA32_FEAT_CTL is left uninitialized and results in VMXON
taking a #GP due to VMX not being fully enabled, i.e. breaks KVM.

Use init_ia32_feat_ctl() to "restore" IA32_FEAT_CTL as it already deals
with the case where the MSR is locked, and because APs already redo
init_ia32_feat_ctl() during suspend by virtue of the SMP boot flow being
used to reinitialize APs upon wakeup.  Do the call in the early wakeup
flow to avoid dependencies in the syscore_ops chain, e.g. simply adding
a resume hook is not guaranteed to work, as KVM does VMXON in its own
resume hook, kvm_resume(), when KVM has active guests.

Fixes: 21bd3467a5 ("KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR")
Reported-by: Brad Campbell <lists2009@fnarfbargle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Brad Campbell <lists2009@fnarfbargle.com>
Cc: stable@vger.kernel.org # v5.6
Link: https://lkml.kernel.org/r/20200608174134.11157-1-sean.j.christopherson@intel.com
2020-06-15 14:18:37 +02:00
Peter Zijlstra
8e8bb06d19 x86/entry, bug: Comment the instrumentation_begin() usage for WARN()
Explain the rationale for annotating WARN(), even though, strictly
speaking printk() and friends are very much not safe in many of the
places we put them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-06-15 14:10:10 +02:00
Peter Zijlstra
14d3b376b6 x86/entry, cpumask: Provide non-instrumented variant of cpu_is_offline()
vmlinux.o: warning: objtool: exc_nmi()+0x12: call to cpumask_test_cpu.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: mce_check_crashing_cpu()+0x12: call to cpumask_test_cpu.constprop.0()leaves .noinstr.text section

  cpumask_test_cpu()
    test_bit()
      instrument_atomic_read()
      arch_test_bit()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-06-15 14:10:09 +02:00
Peter Zijlstra
e825873366 x86, kcsan: Remove __no_kcsan_or_inline usage
Now that KCSAN relies on -tsan-distinguish-volatile we no longer need
the annotation for constant_test_bit(). Remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-06-15 14:10:08 +02:00
Giovanni Gherdovich
e2b0d619b4 x86, sched: check for counters overflow in frequency invariant accounting
The product mcnt * arch_max_freq_ratio can overflows u64.

For context, a large value for arch_max_freq_ratio would be 5000,
corresponding to a turbo_freq/base_freq ratio of 5 (normally it's more like
1500-2000). A large increment frequency for the MPERF counter would be 5GHz
(the base clock of all CPUs on the market today is less than that). With
these figures, a CPU would need to go without a scheduler tick for around 8
days for the u64 overflow to happen. It is unlikely, but the check is
warranted.

Under similar conditions, the difference acnt of two consecutive APERF
readings can overflow as well.

In these circumstances is appropriate to disable frequency invariant
accounting: the feature relies on measures of the clock frequency done at
every scheduler tick, which need to be "fresh" to be at all meaningful.

A note on i386: prior to version 5.1, the GCC compiler didn't have the
builtin function __builtin_mul_overflow. In these GCC versions the macro
check_mul_overflow needs __udivdi3() to do (u64)a/b, which the kernel
doesn't provide. For this reason this change fails to build on i386 if
GCC<5.1, and we protect the entire frequency invariant code behind
CONFIG_X86_64 (special thanks to "kbuild test robot" <lkp@intel.com>).

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-2-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Oleg Nesterov
3dc167ba57 sched/cputime: Improve cputime_adjust()
People report that utime and stime from /proc/<pid>/stat become very
wrong when the numbers are big enough, especially if you watch these
counters incrementally.

Specifically, the current implementation of: stime*rtime/total,
results in a saw-tooth function on top of the desired line, where the
teeth grow in size the larger the values become. IOW, it has a
relative error.

The result is that, when watching incrementally as time progresses
(for large values), we'll see periods of pure stime or utime increase,
irrespective of the actual ratio we're striving for.

Replace scale_stime() with a math64.h helper: mul_u64_u64_div_u64()
that is far more accurate. This also allows architectures to override
the implementation -- for instance they can opt for the old algorithm
if this new one turns out to be too expensive for them.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200519172506.GA317395@hirez.programming.kicks-ass.net
2020-06-15 14:10:00 +02:00
Adrian Hunter
3e46bb40af perf/x86: Add perf text poke events for kprobes
Add perf text poke events for kprobes. That includes:

 - the replaced instruction(s) which are executed out-of-line
   i.e. arch_copy_kprobe() and arch_remove_kprobe()

 - the INT3 that activates the kprobe
   i.e. arch_arm_kprobe() and arch_disarm_kprobe()

 - optimised kprobe function
   i.e. arch_prepare_optimized_kprobe() and
      __arch_remove_optimized_kprobe()

 - optimised kprobe
   i.e. arch_optimize_kprobes() and arch_unoptimize_kprobe()

Resulting in 8 possible text_poke events:

 0:  NULL -> probe.ainsn.insn (if ainsn.boostable && !kp.post_handler)
					arch_copy_kprobe()

 1:  old0 -> INT3			arch_arm_kprobe()

 // boosted kprobe active

 2:  NULL -> optprobe_trampoline	arch_prepare_optimized_kprobe()

 3:  INT3,old1,old2,old3,old4 -> JMP32	arch_optimize_kprobes()

 // optprobe active

 4:  JMP32 -> INT3,old1,old2,old3,old4

 // optprobe disabled and kprobe active (this sometimes goes back to 3)
					arch_unoptimize_kprobe()

 5:  optprobe_trampoline -> NULL	arch_remove_optimized_kprobe()

 // boosted kprobe active

 6:  INT3 -> old0			arch_disarm_kprobe()

 7:  probe.ainsn.insn -> NULL (if ainsn.boostable && !kp.post_handler)
					arch_remove_kprobe()

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-6-adrian.hunter@intel.com
2020-06-15 14:09:49 +02:00
Vitaly Kuznetsov
b1d405751c KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery
KVM now supports using interrupt for 'page ready' APF event delivery and
legacy mechanism was deprecated. Switch KVM guests to the new one.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-9-vkuznets@redhat.com>
[Use HYPERVISOR_CALLBACK_VECTOR instead of a separate vector. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-15 07:46:49 -04:00
Linus Torvalds
a9429089d3 Merge tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Thomas Gleixner:
 "RAS updates from Borislav Petkov:

   - Unmap a whole guest page if an MCE is encountered in it to avoid
     follow-on MCEs leading to the guest crashing, by Tony Luck.

     This change collided with the entry changes and the merge
     resolution would have been rather unpleasant. To avoid that the
     entry branch was merged in before applying this. The resulting code
     did not change over the rebase.

   - AMD MCE error thresholding machinery cleanup and hotplug
     sanitization, by Thomas Gleixner.

   - Change the MCE notifiers to denote whether they have handled the
     error and not break the chain early by returning NOTIFY_STOP, thus
     giving the opportunity for the later handlers in the chain to see
     it. By Tony Luck.

   - Add AMD family 0x17, models 0x60-6f support, by Alexander Monakov.

   - Last but not least, the usual round of fixes and improvements"

* tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy()
  x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned
  EDAC/amd64: Add AMD family 17h model 60h PCI IDs
  hwmon: (k10temp) Add AMD family 17h model 60h PCI match
  x86/amd_nb: Add AMD family 17h model 60h PCI IDs
  x86/mcelog: Add compat_ioctl for 32-bit mcelog support
  x86/mce: Drop bogus comment about mce.kflags
  x86/mce: Fixup exception only for the correct MCEs
  EDAC: Drop the EDAC report status checks
  x86/mce: Add mce=print_all option
  x86/mce: Change default MCE logger to check mce->kflags
  x86/mce: Fix all mce notifiers to update the mce->kflags bitmask
  x86/mce: Add a struct mce.kflags field
  x86/mce: Convert the CEC to use the MCE notifier
  x86/mce: Rename "first" function as "early"
  x86/mce/amd, edac: Remove report_gart_errors
  x86/mce/amd: Make threshold bank setting hotplug robust
  x86/mce/amd: Cleanup threshold device remove path
  x86/mce/amd: Straighten CPU hotplug path
  x86/mce/amd: Sanitize thresholding device creation hotplug path
  ...
2020-06-13 10:21:00 -07:00
Linus Torvalds
076f14be7f Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry updates from Thomas Gleixner:
 "The x86 entry, exception and interrupt code rework

  This all started about 6 month ago with the attempt to move the Posix
  CPU timer heavy lifting out of the timer interrupt code and just have
  lockless quick checks in that code path. Trivial 5 patches.

  This unearthed an inconsistency in the KVM handling of task work and
  the review requested to move all of this into generic code so other
  architectures can share.

  Valid request and solved with another 25 patches but those unearthed
  inconsistencies vs. RCU and instrumentation.

  Digging into this made it obvious that there are quite some
  inconsistencies vs. instrumentation in general. The int3 text poke
  handling in particular was completely unprotected and with the batched
  update of trace events even more likely to expose to endless int3
  recursion.

  In parallel the RCU implications of instrumenting fragile entry code
  came up in several discussions.

  The conclusion of the x86 maintainer team was to go all the way and
  make the protection against any form of instrumentation of fragile and
  dangerous code pathes enforcable and verifiable by tooling.

  A first batch of preparatory work hit mainline with commit
  d5f744f9a2 ("Pull x86 entry code updates from Thomas Gleixner")

  That (almost) full solution introduced a new code section
  '.noinstr.text' into which all code which needs to be protected from
  instrumentation of all sorts goes into. Any call into instrumentable
  code out of this section has to be annotated. objtool has support to
  validate this.

  Kprobes now excludes this section fully which also prevents BPF from
  fiddling with it and all 'noinstr' annotated functions also keep
  ftrace off. The section, kprobes and objtool changes are already
  merged.

  The major changes coming with this are:

    - Preparatory cleanups

    - Annotating of relevant functions to move them into the
      noinstr.text section or enforcing inlining by marking them
      __always_inline so the compiler cannot misplace or instrument
      them.

    - Splitting and simplifying the idtentry macro maze so that it is
      now clearly separated into simple exception entries and the more
      interesting ones which use interrupt stacks and have the paranoid
      handling vs. CR3 and GS.

    - Move quite some of the low level ASM functionality into C code:

       - enter_from and exit to user space handling. The ASM code now
         calls into C after doing the really necessary ASM handling and
         the return path goes back out without bells and whistels in
         ASM.

       - exception entry/exit got the equivivalent treatment

       - move all IRQ tracepoints from ASM to C so they can be placed as
         appropriate which is especially important for the int3
         recursion issue.

    - Consolidate the declaration and definition of entry points between
      32 and 64 bit. They share a common header and macros now.

    - Remove the extra device interrupt entry maze and just use the
      regular exception entry code.

    - All ASM entry points except NMI are now generated from the shared
      header file and the corresponding macros in the 32 and 64 bit
      entry ASM.

    - The C code entry points are consolidated as well with the help of
      DEFINE_IDTENTRY*() macros. This allows to ensure at one central
      point that all corresponding entry points share the same
      semantics. The actual function body for most entry points is in an
      instrumentable and sane state.

      There are special macros for the more sensitive entry points, e.g.
      INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
      They allow to put the whole entry instrumentation and RCU handling
      into safe places instead of the previous pray that it is correct
      approach.

    - The INT3 text poke handling is now completely isolated and the
      recursion issue banned. Aside of the entry rework this required
      other isolation work, e.g. the ability to force inline bsearch.

    - Prevent #DB on fragile entry code, entry relevant memory and
      disable it on NMI, #MC entry, which allowed to get rid of the
      nested #DB IST stack shifting hackery.

    - A few other cleanups and enhancements which have been made
      possible through this and already merged changes, e.g.
      consolidating and further restricting the IDT code so the IDT
      table becomes RO after init which removes yet another popular
      attack vector

    - About 680 lines of ASM maze are gone.

  There are a few open issues:

   - An escape out of the noinstr section in the MCE handler which needs
     some more thought but under the aspect that MCE is a complete
     trainwreck by design and the propability to survive it is low, this
     was not high on the priority list.

   - Paravirtualization

     When PV is enabled then objtool complains about a bunch of indirect
     calls out of the noinstr section. There are a few straight forward
     ways to fix this, but the other issues vs. general correctness were
     more pressing than parawitz.

   - KVM

     KVM is inconsistent as well. Patches have been posted, but they
     have not yet been commented on or picked up by the KVM folks.

   - IDLE

     Pretty much the same problems can be found in the low level idle
     code especially the parts where RCU stopped watching. This was
     beyond the scope of the more obvious and exposable problems and is
     on the todo list.

  The lesson learned from this brain melting exercise to morph the
  evolved code base into something which can be validated and understood
  is that once again the violation of the most important engineering
  principle "correctness first" has caused quite a few people to spend
  valuable time on problems which could have been avoided in the first
  place. The "features first" tinkering mindset really has to stop.

  With that I want to say thanks to everyone involved in contributing to
  this effort. Special thanks go to the following people (alphabetical
  order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
  Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
  Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
  Vitaly Kuznetsov, and Will Deacon"

* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
  x86/entry: Force rcu_irq_enter() when in idle task
  x86/entry: Make NMI use IDTENTRY_RAW
  x86/entry: Treat BUG/WARN as NMI-like entries
  x86/entry: Unbreak __irqentry_text_start/end magic
  x86/entry: __always_inline CR2 for noinstr
  lockdep: __always_inline more for noinstr
  x86/entry: Re-order #DB handler to avoid *SAN instrumentation
  x86/entry: __always_inline arch_atomic_* for noinstr
  x86/entry: __always_inline irqflags for noinstr
  x86/entry: __always_inline debugreg for noinstr
  x86/idt: Consolidate idt functionality
  x86/idt: Cleanup trap_init()
  x86/idt: Use proper constants for table size
  x86/idt: Add comments about early #PF handling
  x86/idt: Mark init only functions __init
  x86/entry: Rename trace_hardirqs_off_prepare()
  x86/entry: Clarify irq_{enter,exit}_rcu()
  x86/entry: Remove DBn stacks
  x86/entry: Remove debug IDT frobbing
  x86/entry: Optimize local_db_save() for virt
  ...
2020-06-13 10:05:47 -07:00
Linus Torvalds
52cd0d972f Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
 "The guest side of the asynchronous page fault work has been delayed to
  5.9 in order to sync with Thomas's interrupt entry rework, but here's
  the rest of the KVM updates for this merge window.

  MIPS:
   - Loongson port

  PPC:
   - Fixes

  ARM:
   - Fixes

  x86:
   - KVM_SET_USER_MEMORY_REGION optimizations
   - Fixes
   - Selftest fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (62 commits)
  KVM: x86: do not pass poisoned hva to __kvm_set_memory_region
  KVM: selftests: fix sync_with_host() in smm_test
  KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected
  KVM: async_pf: Cleanup kvm_setup_async_pf()
  kvm: i8254: remove redundant assignment to pointer s
  KVM: x86: respect singlestep when emulating instruction
  KVM: selftests: Don't probe KVM_CAP_HYPERV_ENLIGHTENED_VMCS when nested VMX is unsupported
  KVM: selftests: do not substitute SVM/VMX check with KVM_CAP_NESTED_STATE check
  KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
  KVM: arm64: Move hyp_symbol_addr() to kvm_asm.h
  KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
  KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
  KVM: arm64: Remove host_cpu_context member from vcpu structure
  KVM: arm64: Stop sparse from moaning at __hyp_this_cpu_ptr
  KVM: arm64: Handle PtrAuth traps early
  KVM: x86: Unexport x86_fpu_cache and make it static
  KVM: selftests: Ignore KVM 5-level paging support for VM_MODE_PXXV48_4K
  KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
  KVM: arm64: Stop save/restoring ACTLR_EL1
  KVM: arm64: Add emulation for 32bit guests accessing ACTLR2
  ...
2020-06-12 11:05:52 -07:00
Thomas Gleixner
71ed49d8fb x86/entry: Make NMI use IDTENTRY_RAW
For no reason other than beginning brainmelt, IDTENTRY_NMI was mapped to
IDTENTRY_IST.

This is not a problem on 64bit because the IST default entry point maps to
IDTENTRY_RAW which does not any entry handling. The surplus function
declaration for the noist C entry point is unused and as there is no ASM
code emitted for NMI this went unnoticed.

On 32bit IDTENTRY_IST maps to a regular IDTENTRY which does the normal
entry handling. That is clearly the wrong thing to do for NMI.

Map it to IDTENTRY_RAW to unbreak it. The IDTENTRY_NMI mapping needs to
stay to avoid emitting ASM code.

Fixes: 6271fef00b ("x86/entry: Convert NMI to IDTENTRY_NMI")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Debugged-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/CA+G9fYvF3cyrY+-iw_SZtpN-i2qA2BruHg4M=QYECU2-dNdsMw@mail.gmail.com
2020-06-12 14:15:48 +02:00
Andy Lutomirski
15a416e8aa x86/entry: Treat BUG/WARN as NMI-like entries
BUG/WARN are cleverly optimized using UD2 to handle the BUG/WARN out of
line in an exception fixup.

But if BUG or WARN is issued in a funny RCU context, then the
idtentry_enter...() path might helpfully WARN that the RCU context is
invalid, which results in infinite recursion.

Split the BUG/WARN handling into an nmi_enter()/nmi_exit() path in
exc_invalid_op() to increase the chance to survive the experience.

[ tglx: Make the declaration match the implementation ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/f8fe40e0088749734b4435b554f73eee53dcf7a8.1591932307.git.luto@kernel.org
2020-06-12 12:12:57 +02:00
Linus Torvalds
b791d1bdf9 Merge tag 'locking-kcsan-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull the Kernel Concurrency Sanitizer from Thomas Gleixner:
 "The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector,
  which relies on compile-time instrumentation, and uses a
  watchpoint-based sampling approach to detect races.

  The feature was under development for quite some time and has already
  found legitimate bugs.

  Unfortunately it comes with a limitation, which was only understood
  late in the development cycle:

     It requires an up to date CLANG-11 compiler

  CLANG-11 is not yet released (scheduled for June), but it's the only
  compiler today which handles the kernel requirements and especially
  the annotations of functions to exclude them from KCSAN
  instrumentation correctly.

  These annotations really need to work so that low level entry code and
  especially int3 text poke handling can be completely isolated.

  A detailed discussion of the requirements and compiler issues can be
  found here:

    https://lore.kernel.org/lkml/CANpmjNMTsY_8241bS7=XAfqvZHFLrVEkv_uM4aDUWE_kh3Rvbw@mail.gmail.com/

  We came to the conclusion that trying to work around compiler
  limitations and bugs again would end up in a major trainwreck, so
  requiring a working compiler seemed to be the best choice.

  For Continous Integration purposes the compiler restriction is
  manageable and that's where most xxSAN reports come from.

  For a change this limitation might make GCC people actually look at
  their bugs. Some issues with CSAN in GCC are 7 years old and one has
  been 'fixed' 3 years ago with a half baken solution which 'solved' the
  reported issue but not the underlying problem.

  The KCSAN developers also ponder to use a GCC plugin to become
  independent, but that's not something which will show up in a few
  days.

  Blocking KCSAN until wide spread compiler support is available is not
  a really good alternative because the continuous growth of lockless
  optimizations in the kernel demands proper tooling support"

* tag 'locking-kcsan-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
  compiler_types.h, kasan: Use __SANITIZE_ADDRESS__ instead of CONFIG_KASAN to decide inlining
  compiler.h: Move function attributes to compiler_types.h
  compiler.h: Avoid nested statement expression in data_race()
  compiler.h: Remove data_race() and unnecessary checks from {READ,WRITE}_ONCE()
  kcsan: Update Documentation to change supported compilers
  kcsan: Remove 'noinline' from __no_kcsan_or_inline
  kcsan: Pass option tsan-instrument-read-before-write to Clang
  kcsan: Support distinguishing volatile accesses
  kcsan: Restrict supported compilers
  kcsan: Avoid inserting __tsan_func_entry/exit if possible
  ubsan, kcsan: Don't combine sanitizer with kcov on clang
  objtool, kcsan: Add kcsan_disable_current() and kcsan_enable_current_nowarn()
  kcsan: Add __kcsan_{enable,disable}_current() variants
  checkpatch: Warn about data_race() without comment
  kcsan: Use GFP_ATOMIC under spin lock
  Improve KCSAN documentation a bit
  kcsan: Make reporting aware of KCSAN tests
  kcsan: Fix function matching in report
  kcsan: Change data_race() to no longer require marking racing accesses
  kcsan: Move kcsan_{disable,enable}_current() to kcsan-checks.h
  ...
2020-06-11 18:55:43 -07:00
Linus Torvalds
9716e57a01 Merge tag 'locking-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull atomics rework from Thomas Gleixner:
 "Peter Zijlstras rework of atomics and fallbacks. This solves two
  problems:

   1) Compilers uninline small atomic_* static inline functions which
      can expose them to instrumentation.

   2) The instrumentation of atomic primitives was done at the
      architecture level while composites or fallbacks were provided at
      the generic level. As a result there are no uninstrumented
      variants of the fallbacks.

  Both issues were in the way of fully isolating fragile entry code
  pathes and especially the text poke int3 handler which is prone to an
  endless recursion problem when anything in that code path is about to
  be instrumented. This was always a problem, but got elevated due to
  the new batch mode updates of tracing.

  The solution is to mark the functions __always_inline and to flip the
  fallback and instrumentation so the non-instrumented variants are at
  the architecture level and the instrumentation is done in generic
  code.

  The latter introduces another fallback variant which will go away once
  all architectures have been moved over to arch_atomic_*"

* tag 'locking-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/atomics: Flip fallbacks and instrumentation
  asm-generic/atomic: Use __always_inline for fallback wrappers
2020-06-11 18:27:19 -07:00
Linus Torvalds
6a45a65888 Merge tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more x86 updates from Thomas Gleixner:
 "A set of fixes and updates for x86:

   - Unbreak paravirt VDSO clocks.

     While the VDSO code was moved into lib for sharing a subtle check
     for the validity of paravirt clocks got replaced. While the
     replacement works perfectly fine for bare metal as the update of
     the VDSO clock mode is synchronous, it fails for paravirt clocks
     because the hypervisor can invalidate them asynchronously.

     Bring it back as an optional function so it does not inflict this
     on architectures which are free of PV damage.

   - Fix the jiffies to jiffies64 mapping on 64bit so it does not
     trigger an ODR violation on newer compilers

   - Three fixes for the SSBD and *IB* speculation mitigation maze to
     ensure consistency, not disabling of some *IB* variants wrongly and
     to prevent a rogue cross process shutdown of SSBD. All marked for
     stable.

   - Add yet more CPU models to the splitlock detection capable list
     !@#%$!

   - Bring the pr_info() back which tells that TSC deadline timer is
     enabled.

   - Reboot quirk for MacBook6,1"

* tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Unbreak paravirt VDSO clocks
  lib/vdso: Provide sanity check for cycles (again)
  clocksource: Remove obsolete ifdef
  x86_64: Fix jiffies ODR violation
  x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
  x86/speculation: Prevent rogue cross-process SSBD shutdown
  x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.
  x86/cpu: Add Sapphire Rapids CPU model number
  x86/split_lock: Add Icelake microserver and Tigerlake CPU models
  x86/apic: Make TSC deadline timer detection message visible
  x86/reboot/quirks: Add MacBook6,1 reboot quirk
2020-06-11 15:54:31 -07:00
Thomas Gleixner
37d1a04b13 Rebase locking/kcsan to locking/urgent
Merge the state of the locking kcsan branch before the read/write_once()
and the atomics modifications got merged.

Squash the fallout of the rebase on top of the read/write once and atomic
fallback work into the merge. The history of the original branch is
preserved in tag locking-kcsan-2020-06-02.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-11 20:02:46 +02:00
Vitaly Kuznetsov
2a18b7e7cd KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected
'Page not present' event may or may not get injected depending on
guest's state. If the event wasn't injected, there is no need to
inject the corresponding 'page ready' event as the guest may get
confused. E.g. Linux thinks that the corresponding 'page not present'
event wasn't delivered *yet* and allocates a 'dummy entry' for it.
This entry is never freed.

Note, 'wakeup all' events have no corresponding 'page not present'
event and always get injected.

s390 seems to always be able to inject 'page not present', the
change is effectively a nop.

Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200610175532.779793-2-vkuznets@redhat.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11 12:35:19 -04:00
Tony Luck
17fae1294a x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned
An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

  [ bp: Adjust to x86/entry changes. ]

Fixes: 284ce4011b ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
2020-06-11 15:19:17 +02:00
Thomas Gleixner
f77d26a9fc Merge branch 'x86/entry' into ras/core
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
2020-06-11 15:17:57 +02:00
Thomas Gleixner
f0178fc01f x86/entry: Unbreak __irqentry_text_start/end magic
The entry rework moved interrupt entry code from the irqentry to the
noinstr section which made the irqentry section empty.

This breaks boundary checks which rely on the __irqentry_text_start/end
markers to find out whether a function in a stack trace is
interrupt/exception entry code. This affects the function graph tracer and
filter_irq_stacks().

As the IDT entry points are all sequentialy emitted this is rather simple
to unbreak by injecting __irqentry_text_start/end as global labels.

To make this work correctly:

  - Remove the IRQENTRY_TEXT section from the x86 linker script
  - Define __irqentry so it breaks the build if it's used
  - Adjust the entry mirroring in PTI
  - Remove the redundant kprobes and unwinder bound checks

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-11 15:15:29 +02:00
Peter Zijlstra
2823e83a3d x86/entry: __always_inline CR2 for noinstr
vmlinux.o: warning: objtool: exc_page_fault()+0x9: call to read_cr2() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_page_fault()+0x24: call to prefetchw() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_page_fault()+0x21: call to kvm_handle_async_pf.isra.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_nmi()+0x1cc: call to write_cr2() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200603114052.243227806@infradead.org
2020-06-11 15:15:28 +02:00
Peter Zijlstra
4b281e541b x86/entry: __always_inline arch_atomic_* for noinstr
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x33: call to arch_atomic_and.constprop.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200603114052.070166551@infradead.org
2020-06-11 15:15:27 +02:00
Peter Zijlstra
7a745be1cc x86/entry: __always_inline irqflags for noinstr
vmlinux.o: warning: objtool: lockdep_hardirqs_on()+0x65: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lockdep_hardirqs_off()+0x5d: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lock_is_held_type()+0x35: call to arch_local_irq_save() leaves .noinstr.text section
vmlinux.o: warning: objtool: check_preemption_disabled()+0x31: call to arch_local_save_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: check_preemption_disabled()+0x33: call to arch_irqs_disabled_flags() leaves .noinstr.text section
vmlinux.o: warning: objtool: lock_is_held_type()+0x2f: call to native_irq_disable() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200603114052.012171668@infradead.org
2020-06-11 15:15:27 +02:00
Peter Zijlstra
28eaf87121 x86/entry: __always_inline debugreg for noinstr
vmlinux.o: warning: objtool: exc_debug()+0x21: call to native_get_debugreg() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200603114051.954401211@infradead.org
2020-06-11 15:15:26 +02:00
Thomas Gleixner
3e77abda65 x86/idt: Consolidate idt functionality
- Move load_current_idt() out of line and replace the hideous comment with
   a lockdep assert. This allows to make idt_table and idt_descr static.

 - Mark idt_table read only after the IDT initialization is complete.

 - Shuffle code around to consolidate the #ifdef sections into one.

 - Adapt the F00F bug code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200528145523.084915381@linutronix.de
2020-06-11 15:15:26 +02:00
Peter Zijlstra
59bc300b71 x86/entry: Clarify irq_{enter,exit}_rcu()
Because:

  irq_enter_rcu() includes lockdep_hardirq_enter()
  irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()

Which resulted in two 'stray' lockdep_hardirq_exit() calls in
idtentry.h, and me spending a long time trying to find the matching
enter calls.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.359433429@infradead.org
2020-06-11 15:15:24 +02:00
Peter Zijlstra
fd501d4f03 x86/entry: Remove DBn stacks
Both #DB itself, as all other IST users (NMI, #MC) now clear DR7 on
entry. Combined with not allowing breakpoints on entry/noinstr/NOKPROBE
text and no single step (EFLAGS.TF) inside the #DB handler should guarantee
no nested #DB.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.303027161@infradead.org
2020-06-11 15:15:23 +02:00
Peter Zijlstra
f9912ada82 x86/entry: Remove debug IDT frobbing
This is all unused now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.245019500@infradead.org
2020-06-11 15:15:23 +02:00
Peter Zijlstra
84b6a34915 x86/entry: Optimize local_db_save() for virt
Because DRn access is 'difficult' with virt; but the DR7 read is cheaper
than a cacheline miss on native, add a virt specific fast path to
local_db_save(), such that when breakpoints are not in use to avoid
touching DRn entirely.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.187833200@infradead.org
2020-06-11 15:15:22 +02:00
Peter Zijlstra
e1de11d4d1 x86/entry: Introduce local_db_{save,restore}()
In order to allow other exceptions than #DB to disable breakpoints,
provide common helpers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.012060983@infradead.org
2020-06-11 15:15:21 +02:00
Thomas Gleixner
320100a5ff x86/entry: Remove the TRACE_IRQS cruft
No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202120.523289762@linutronix.de
2020-06-11 15:15:19 +02:00