On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
More details on the bug and its manifestation with Windows and OVMF:
It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
I believe that the SMM part explains why we started seeing this only with
OVMF.
KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
later on in x86_emulate_instruction we overwrite arch.hflags with
ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
The AMD-specific hflag of interest here is HF_NMI_MASK.
When rebooting the system, Windows sends an NMI IPI to all but the current
cpu to shut them down. Only after all of them are parked in HLT will the
initiating cpu finish the restart. If NMI is masked, other cpus never get
the memo and the initiating cpu spins forever, waiting for
hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
Fixes: a584539b24 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_make_all_requests() provides a synchronization that waits until all
kicked VCPUs have acknowledged the kick. This is important for
KVM_REQ_MMU_RELOAD as it prevents freeing while lockless paging is
underway.
This patch adds the synchronization property into all requests that are
currently being used with kvm_make_all_requests() in order to preserve
the current behavior and only introduce a new framework. Removing it
from requests where it is not necessary is left for future patches.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some operations must ensure that the guest is not running with stale
data, but if the guest is halted, then the update can wait until another
event happens. kvm_make_all_requests() currently doesn't wake up, so we
can mark all requests used with it.
First 8 bits were arbitrarily reserved for request numbers.
Most uses of requests have the request type as a constant, so a compiler
will optimize the '&'.
An alternative would be to have an inline function that would return
whether the request needs a wake-up or not, but I like this one better
even though it might produce worse assembly.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.
This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This reverts commit 2947ba054a.
Dan Williams reported dax-pmem kernel warnings with the following signature:
WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
... and bisected it to this commit, which suggests possible memory corruption
caused by the x86 fast-GUP conversion.
He also pointed out:
"
This is similar to the backtrace when we were not properly handling
pud faults and was fixed with this commit: 220ced1676 "mm: fix
get_user_pages() vs device-dax pud mappings"
I've found some missing _devmap checks in the generic
get_user_pages_fast() path, but this does not fix the regression
[...]
"
So given that there are known bugs, and a pretty robust looking bisection
points to this commit suggesting that are unknown bugs in the conversion
as well, revert it for the time being - we'll re-try in v4.13.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dann.frazier@canonical.com
Cc: dave.hansen@intel.com
Cc: steve.capper@linaro.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hardware support for faulting on the cpuid instruction is not required to
emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
cpuid-induced VM exit checks the cpuid faulting state and the CPL.
kvm_require_cpl is even kind enough to inject the GP fault for us.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We can declare it <linux/pci.h> even on platforms where it isn't going to
be defined. There's no need to have it littered through the various
<asm/pci.h> files.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Most of the almost-identical versions of pci_mmap_page_range() silently
ignore the 'write_combine' argument and give uncached mappings.
Yet we allow the PCIIOC_WRITE_COMBINE ioctl in /proc/bus/pci, expose the
'resourceX_wc' file in sysfs, and allow an attempted mapping to apparently
succeed.
To fix this, introduce a macro arch_can_pci_mmap_wc() which indicates
whether the platform can do a write-combining mapping. On x86 this ends up
being pat_enabled(), while the few other platforms that support it can just
set it to a literal '1'.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Merge branch 'i2c/for-INT33FE' of
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux.git
to prepare for an incoming INT33FE driver.
Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
Unlike normal compat syscall variants, it is needed only for
biarch architectures that have different alignement requirements for
u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
a store of 64bit value at 32bit-aligned address. We used to have one
such (ia64), but its biarch support has been gone since 2010 (after
being broken in 2008, which went unnoticed since nobody had been using
it).
It had escaped removal at the same time only because back in 2004
a patch that switched several syscalls on amd64 from private wrappers to
generic compat ones had switched to use of compat_sys_getdents64(), which
hadn't needed (or used) a compat wrapper on amd64.
Let's bury it - it's at least 7 years overdue.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
gcc-4.4.3 fails to statically initialize members of a anon union.
See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
The storage saving is not really worth it and aside of that it will catch
usage of the cache member for bandwidth and vice versa easier.
Fixes: 05b93417ce ("x86/intel_rdt/mba: Add primary support for Memory Bandwidth Allocation (MBA)")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull nvdimm fixes from Dan Williams:
"A small crop of lockdep, sleeping while atomic, and other fixes /
band-aids in advance of the full-blown reworks targeting the next
merge window. The largest change here is "libnvdimm: fix blk free
space accounting" which deletes a pile of buggy code that better
testing would have caught before merging. The next change that is
borderline too big for a late rc is switching the device-dax locking
from rcu to srcu, I couldn't think of a smaller way to make that fix.
The __copy_user_nocache fix will have a full replacement in 4.12 to
move those pmem special case considerations into the pmem driver. The
"libnvdimm: band aid btt vs clear poison locking" commit admits that
our error clearing support for btt went in broken, so we just disable
it in 4.11 and -stable. A replacement / full fix is in the pipeline
for 4.12
Some of these would have been caught earlier had DEBUG_ATOMIC_SLEEP
been enabled on my development station. I wonder if we should have:
config DEBUG_ATOMIC_SLEEP
default PROVE_LOCKING
...since I mistakenly thought I got both with PROVE_LOCKING=y.
These have received a build success notification from the 0day robot,
and some have appeared in a -next release with no reported issues"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation
libnvdimm: band aid btt vs clear poison locking
libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
libnvdimm: fix blk free space accounting
acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)
The files in the info directory for MBA are as follows:
num_closids
The maximum number of CLOSids available for MBA
min_bandwidth
The minimum memory bandwidth percentage value
bandwidth_gran
The granularity of the bandwidth control in percent for the
particular CPU SKU. Intermediate values entered are rounded off
to the previous control step available. Available bandwidth
control steps are minimum_bandwidth + N * bandwidth_gran.
delay_linear
When set, the OS writes a linear percentage based value to the
control MSRs ranging from minimum_bandwidth to 100 percent.
This value is informational and has no influence on the values
written to the schemata files. The values written to the
schemata are always bandwidth percentage that is requested.
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: vikas.shivappa@intel.com
Link: http://lkml.kernel.org/r/1491611637-20417-7-git-send-email-vikas.shivappa@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Memory bandwidth allocation requires different information than cache
allocation.
To avoid a lump of data in struct rdt_resource, move all cache related
information into a seperate structure and add that to struct rdt_resource.
Sanitize the data types while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: vikas.shivappa@intel.com
A few people have reported unwinder warnings like the following:
WARNING: kernel stack frame pointer at ffffc90000fe7ff0 in rsync:1157 has bad value (null)
unwind stack type:0 next_sp: (null) mask:2 graph_idx:0
ffffc90000fe7f98: ffffc90000fe7ff0 (0xffffc90000fe7ff0)
ffffc90000fe7fa0: ffffffffb7000f56 (trace_hardirqs_off_thunk+0x1a/0x1c)
ffffc90000fe7fa8: 0000000000000246 (0x246)
ffffc90000fe7fb0: 0000000000000000 ...
ffffc90000fe7fc0: 00007ffe3af639bc (0x7ffe3af639bc)
ffffc90000fe7fc8: 0000000000000006 (0x6)
ffffc90000fe7fd0: 00007f80af433fc5 (0x7f80af433fc5)
ffffc90000fe7fd8: 00007ffe3af638e0 (0x7ffe3af638e0)
ffffc90000fe7fe0: 00007ffe3af638e0 (0x7ffe3af638e0)
ffffc90000fe7fe8: 00007ffe3af63970 (0x7ffe3af63970)
ffffc90000fe7ff0: 0000000000000000 ...
ffffc90000fe7ff8: ffffffffb7b74b9a (entry_SYSCALL_64_after_swapgs+0x17/0x4f)
This warning can happen when unwinding a code path where an interrupt
occurred in x86 entry code before it set up the first stack frame.
Silently ignore any warnings for this case.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: c32c47c68a ("x86/unwind: Warn on bad frame pointer")
Link: http://lkml.kernel.org/r/dbd6838826466a60dc23a52098185bc973ce2f1e.1492020577.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A new call to SCU intel_scu_ipc_raw_command() writes SPTR and DPTR
registers before sending a command.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.
Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.
Fixes: 5de490daec ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: <stable@vger.kernel.org>
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Let's add a new mode and set it while we create the irqchip via
KVM_CREATE_IRQCHIP and KVM_CAP_SPLIT_IRQCHIP.
This mode will be used later to test if adding routes
(in kvm_set_routing_entry()) is already allowed.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
There's a conflict between ongoing level-5 paging support and
the E820 rewrite. Since the E820 rewrite is essentially ready,
merge it into x86/mm to reduce tree conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The E820 rework in WIP.x86/boot has gone through a couple of weeks
of exposure in -tip, merge it in a wider fashion.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The resource control filesystem provides only a bitmask based cpus file for
assigning CPUs to a resource group. That's cumbersome with large cpumasks
and non-intuitive when modifying the file from the command line.
Range based cpu lists are commonly used along with bitmask based cpu files
in various subsystems throughout the kernel.
Add 'cpus_list' file which is CPU range based.
# cd /sys/fs/resctrl/
# echo 1-10 > krava/cpus_list
# cat krava/cpus_list
1-10
# cat krava/cpus
0007fe
# cat cpus
fffff9
# cat cpus_list
0,3-23
[ tglx: Massaged changelog and replaced "bitmask lists" by "CPU ranges" ]
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20170410145232.GF25354@krava
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The vsyscall32 sysctl can racy against a concurrent fork when it switches
from disabled to enabled:
arch_setup_additional_pages()
if (vdso32_enabled)
--> No mapping
sysctl.vsysscall32()
--> vdso32_enabled = true
create_elf_tables()
ARCH_DLINFO_IA32
if (vdso32_enabled) {
--> Add VDSO entry with NULL pointer
Make ARCH_DLINFO_IA32 check whether the VDSO mapping has been set up for
the newly forked process or not.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170410151723.602367196@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Its value has never changed; we might as well make it part of the ABI instead
of using the return value of KVM_CHECK_EXTENSION(KVM_CAP_COALESCED_MMIO).
Because PPC does not always make MMIO available, the code has to be made
dependent on CONFIG_KVM_MMIO rather than KVM_COALESCED_MMIO_PAGE_OFFSET.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
In order to simplify adding exit reasons in the future,
the array of exit reason names is now also sorted by
exit reason code.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Now use bit 6 of EPTP to optionally enable A/D bits for EPTP. Another
thing to change is that, when EPT accessed and dirty bits are not in use,
VMX treats accesses to guest paging structures as data reads. When they
are in use (bit 6 of EPTP is set), they are treated as writes and the
corresponding EPT dirty bit is set. The MMU didn't know this detail,
so this patch adds it.
We also have to fix up the exit qualification. It may be wrong because
KVM sets bit 6 but the guest might not.
L1 emulates EPT A/D bits using write permissions, so in principle it may
be possible for EPT A/D bits to be used by L1 even though not available
in hardware. The problem is that guest page-table walks will be treated
as reads rather than writes, so they would not cause an EPT violation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed typo in walk_addr_generic() comment and changed bit clear +
conditional-set pattern in handle_ept_violation() to conditional-clear]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>