As part of the effort to separate out architecture specific code, move the
check for detecting if the hypercall page is setup.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code, move the
crash notification function.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code,
extract hypervisor version information in an architecture specific
file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code,
consolidate all Hyper-V specific clocksource code to an architecture
specific code.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code, move the
hypercall invocation code to an architecture specific file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code, move the
hypercall page setup to an architecture specific file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code, move the
definition of generate_guest_id() to x86 specific header file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to separate out architecture specific code, move the
definition of hv_x64_msr_hypercall_contents to x86 specific header file.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vector population count instructions for dwords and qwords are going to be
available in future Intel Xeon & Xeon Phi processors. Bit 14 of
CPUID[level:0x07, ECX] indicates that the instructions are supported by a
processor.
The specification can be found in the Intel Software Developer Manual (SDM)
and in the Instruction Set Extensions Programming Reference (ISE).
Populate the feature bit and clear it when xsave is disabled.
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Link: http://lkml.kernel.org/r/20170110173403.6010-2-piotr.luc@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There are a handful of callers to save_stack_trace_tsk() and
show_stack() which try to unwind the stack of a task other than current.
In such cases, it's remotely possible that the task is running on one
CPU while the unwinder is reading its stack from another CPU, causing
the unwinder to see stack corruption.
These cases seem to be mostly harmless. The unwinder has checks which
prevent it from following bad pointers beyond the bounds of the stack.
So it's not really a bug as long as the caller understands that
unwinding another task will not always succeed.
In such cases, it's possible that the unwinder may read a KASAN-poisoned
region of the stack. Account for that by using READ_ONCE_NOCHECK() when
reading the stack of another task.
Use READ_ONCE() when reading the stack of the current task, since KASAN
warnings can still be useful for finding bugs in that case.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since on Intel we're required to do CPUID(1) first, before reading
the microcode revision MSR, let's add a special helper which does the
required steps so that we don't forget to do them next time, when we
want to read the microcode revision.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20170109114147.5082-4-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This statistic can be useful to estimate the cost of an IRQ injection
scenario, by comparing it with irq_injections. For example the stat
shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.
The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.
In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.
The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.
hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):
hash max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits 49847
hash_64 105
perfect 97
While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).
Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").
Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Report the maximum number of mmu_page_hash collisions as a per-VM stat.
This will make it easy to identify problems with the mmu_page_hash in
the future.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
just complicated the code.
Add a separate state for the irqchip mode.
Reviewed-by: David Hildenbrand <david@redhat.com>
[Used Paolo's version of condition in irqchip_in_kernel().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
In commit 6290602709 ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no point in having an extra type for extra confusion. u64 is
unambiguous.
Conversion was done with the following coccinelle script:
@rem@
@@
-typedef u64 cycle_t;
@fix@
typedef cycle_t;
@@
-cycle_t
+u64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 fixes from Ingo Molnar:
"There's a number of fixes:
- a round of fixes for CPUID-less legacy CPUs
- a number of microcode loader fixes
- i8042 detection robustization fixes
- stack dump/unwinder fixes
- x86 SoC platform driver fixes
- a GCC 7 warning fix
- virtualization related fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
Revert "x86/unwind: Detect bad stack return address"
x86/paravirt: Mark unused patch_default label
x86/microcode/AMD: Reload proper initrd start address
x86/platform/intel/quark: Add printf attribute to imr_self_test_result()
x86/platform/intel-mid: Switch MPU3050 driver to IIO
x86/alternatives: Do not use sync_core() to serialize I$
x86/topology: Document cpu_llc_id
x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
x86/asm: Rewrite sync_core() to use IRET-to-self
x86/microcode/intel: Replace sync_core() with native_cpuid()
Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"
x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
x86/cpu: Probe CPUID leaf 6 even when cpuid_level == 6
x86/tools: Fix gcc-7 warning in relocs.c
x86/unwind: Dump stack data on warnings
x86/unwind: Adjust last frame check for aligned function stacks
x86/init: Fix a couple of comment typos
x86/init: Remove i8042_detect() from platform ops
Input: i8042 - Trust firmware a bit more when probing on X86
x86/init: Add i8042 state to the platform data
...
Pull x86 cache allocation interface from Thomas Gleixner:
"This provides support for Intel's Cache Allocation Technology, a cache
partitioning mechanism.
The interface is odd, but the hardware interface of that CAT stuff is
odd as well.
We tried hard to come up with an abstraction, but that only allows
rather simple partitioning, but no way of sharing and dealing with the
per package nature of this mechanism.
In the end we decided to expose the allocation bitmaps directly so all
combinations of the hardware can be utilized.
There are two ways of associating a cache partition:
- Task
A task can be added to a resource group. It uses the cache
partition associated to the group.
- CPU
All tasks which are not member of a resource group use the group to
which the CPU they are running on is associated with.
That allows for simple CPU based partitioning schemes.
The main expected user sare:
- Virtualization so a VM can only trash only the associated part of
the cash w/o disturbing others
- Real-Time systems to seperate RT and general workloads.
- Latency sensitive enterprise workloads
- In theory this also can be used to protect against cache side
channel attacks"
[ Intel RDT is "Resource Director Technology". The interface really is
rather odd and very specific, which delayed this pull request while I
was thinking about it. The pull request itself came in early during
the merge window, I just delayed it until things had calmed down and I
had more time.
But people tell me they'll use this, and the good news is that it is
_so_ specific that it's rather independent of anything else, and no
user is going to depend on the interface since it's pretty rare. So if
push comes to shove, we can just remove the interface and nothing will
break ]
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/intel_rdt: Implement show_options() for resctrlfs
x86/intel_rdt: Call intel_rdt_sched_in() with preemption disabled
x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount
x86/intel_rdt: Fix setting of closid when adding CPUs to a group
x86/intel_rdt: Update percpu closid immeditately on CPUs affected by changee
x86/intel_rdt: Reset per cpu closids on unmount
x86/intel_rdt: Select KERNFS when enabling INTEL_RDT_A
x86/intel_rdt: Prevent deadlock against hotplug lock
x86/intel_rdt: Protect info directory from removal
x86/intel_rdt: Add info files to Documentation
x86/intel_rdt: Export the minimum number of set mask bits in sysfs
x86/intel_rdt: Propagate error in rdt_mount() properly
x86/intel_rdt: Add a missing #include
MAINTAINERS: Add maintainer for Intel RDT resource allocation
x86/intel_rdt: Add scheduler hook
x86/intel_rdt: Add schemata file
x86/intel_rdt: Add tasks files
x86/intel_rdt: Add cpus file
x86/intel_rdt: Add mkdir to resctrl file system
x86/intel_rdt: Add "info" files to resctrl file system
...
Pull KVM fixes from Paolo Bonzini:
"Early fixes for x86.
Instead of the (botched) revert, the lockdep/might_sleep splat has a
real fix provided by Andrea"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
kvm: take srcu lock around kvm_steal_time_set_preempted()
kvm: fix schedule in atomic in kvm_steal_time_set_preempted()
KVM: hyperv: fix locking of struct kvm_hv fields
KVM: x86: Expose Intel AVX512IFMA/AVX512VBMI/SHA features to guest.
kvm: nVMX: Correct a VMX instruction error code for VMPTRLD
Add i8042 state to the platform data to help i8042 driver make decision
whether to probe for i8042 or not. We recognize 3 states: platform/subarch
ca not possible have i8042 (as is the case with Inrel MID platform),
firmware (such as ACPI) reports that i8042 is absent from the device,
or i8042 may be present and the driver should probe for it.
The intent is to allow i8042 driver abort initialization on x86 if PNP data
(absence of both keyboard and mouse PNP devices) agrees with firmware data.
It will also allow us to remove i8042_detect later.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Cc: linux-input@vger.kernel.org
Link: http://lkml.kernel.org/r/1481317061-31486-2-git-send-email-dmitry.torokhov@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer updates from Thomas Gleixner:
"This is the last functional update from the tip tree for 4.10. It got
delayed due to a newly reported and anlyzed variant of BIOS bug and
the resulting wreckage:
- Seperation of TSC being marked realiable and the fact that the
platform provides the TSC frequency via CPUID/MSRs and making use
for it for GOLDMONT.
- TSC adjust MSR validation and sanitizing:
The TSC adjust MSR contains the offset to the hardware counter. The
sum of the adjust MSR and the counter is the TSC value which is
read via RDTSC.
On at least two machines from different vendors the BIOS sets the
TSC adjust MSR to negative values. This happens on cold and warm
boot. While on cold boot the offset is a few milliseconds, on warm
boot it basically compensates the power on time of the system. The
BIOSes are not even using the adjust MSR to set all CPUs in the
package to the same offset. The offsets are different which renders
the TSC unusable,
What's worse is that the TSC deadline timer has a HW feature^Wbug.
It malfunctions when the TSC adjust value is negative or greater
equal 0x80000000 resulting in silent boot failures, hard lockups or
non firing timers. This looks like some hardware internal 32/64bit
issue with a sign extension problem. Intel has been silent so far
on the issue.
The update contains sanity checks and keeps the adjust register
within working limits and in sync on the package.
As it looks like this disease is spreading via BIOS crapware, we
need to address this urgently as the boot failures are hard to
debug for users"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Limit the adjust value further
x86/tsc: Annotate printouts as firmware bug
x86/tsc: Force TSC_ADJUST register to value >= zero
x86/tsc: Validate TSC_ADJUST after resume
x86/tsc: Validate cpumask pointer before accessing it
x86/tsc: Fix broken CONFIG_X86_TSC=n build
x86/tsc: Try to adjust TSC if sync test fails
x86/tsc: Prepare warp test for TSC adjustment
x86/tsc: Move sync cleanup to a safe place
x86/tsc: Sync test only for the first cpu in a package
x86/tsc: Verify TSC_ADJUST from idle
x86/tsc: Store and check TSC ADJUST MSR
x86/tsc: Detect random warps
x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
x86/tsc: Finalize the split of the TSC_RELIABLE flag
x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
x86/tsc: Mark TSC frequency determined by CPUID as known
x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag
Pull x86 fixes and cleanups from Thomas Gleixner:
"This set of updates contains:
- Robustification for the logical package managment. Cures the AMD
and virtualization issues.
- Put the correct start_cpu() return address on the stack of the idle
task.
- Fixups for the fallout of the nodeid <-> cpuid persistent mapping
modifciations
- Move the x86/MPX specific mm_struct member to the arch specific
mm_context where it belongs
- Cleanups for C89 struct initializers and useless function
arguments"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/floppy: Use designated initializers
x86/mpx: Move bd_addr to mm_context_t
x86/mm: Drop unused argument 'removed' from sync_global_pgds()
ACPI/NUMA: Do not map pxm to node when NUMA is turned off
x86/acpi: Use proper macro for invalid node
x86/smpboot: Prevent false positive out of bounds cpumask access warning
x86/boot/64: Push correct start_cpu() return address
x86/boot/64: Use 'push' instead of 'call' in start_cpu()
x86/smpboot: Make logical package management more robust
Pull kbuild updates from Michal Marek:
- prototypes for x86 asm-exported symbols (Adam Borowski) and a warning
about missing CRCs (Nick Piggin)
- asm-exports fix for LTO (Nicolas Pitre)
- thin archives improvements (Nick Piggin)
- linker script fix for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (Nick
Piggin)
- genksyms support for __builtin_va_list keyword
- misc minor fixes
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
x86/kbuild: enable modversions for symbols exported from asm
kbuild: fix scripts/adjust_autoksyms.sh* for the no modules case
scripts/kallsyms: remove last remnants of --page-offset option
make use of make variable CURDIR instead of calling pwd
kbuild: cmd_export_list: tighten the sed script
kbuild: minor improvement for thin archives build
kbuild: modpost warn if export version crc is missing
kbuild: keep data tables through dead code elimination
kbuild: improve linker compatibility with lib-ksyms.o build
genksyms: Regenerate parser
kbuild/genksyms: handle va_list type
kbuild: thin archives for multi-y targets
kbuild: kallsyms allow 3-pass generation if symbols size has changed
Currently bd_addr lives in mm_struct, which is otherwise architecture
independent. Architecture-specific data is supposed to live within
mm_context_t (itself contained in mm_struct).
Other x86-specific context like the pkey accounting data lives in
mm_context_t, and there's no readon the MPX data can't also live there.
So as to keep the arch-specific data togather, and to set a good example
for others, this patch moves bd_addr into x86's mm_context_t.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1481892055-24596-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce a new mutex to avoid an AB-BA deadlock between kvm->lock and
vcpu->mutex. Protect accesses in kvm_hv_setup_tsc_page too, as suggested
by Roman.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull tracing updates from Steven Rostedt:
"This release has a few updates:
- STM can hook into the function tracer
- Function filtering now supports more advance glob matching
- Ftrace selftests updates and added tests
- Softirq tag in traces now show only softirqs
- ARM nop added to non traced locations at compile time
- New trace_marker_raw file that allows for binary input
- Optimizations to the ring buffer
- Removal of kmap in trace_marker
- Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
- Other various fixes and clean ups"
* tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits)
selftests: ftrace: Shift down default message verbosity
kprobes/trace: Fix kprobe selftest for newer gcc
tracing/kprobes: Add a helper method to return number of probe hits
tracing/rb: Init the CPU mask on allocation
tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
fgraph: Handle a case where a tracer ignores set_graph_notrace
tracing: Replace kmap with copy_from_user() in trace_marker writing
ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
tracing: Allow benchmark to be enabled at early_initcall()
tracing: Have system enable return error if one of the events fail
tracing: Do not start benchmark on boot up
tracing: Have the reg function allow to fail
ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
ring-buffer: Froce rb_update_write_stamp() to be inlined
ring-buffer: Force inline of hotpath helper functions
tracing: Make __buffer_unlock_commit() always_inline
tracing: Make tracepoint_printk a static_key
ring-buffer: Always inline rb_event_data()
ring-buffer: Make rb_reserve_next_event() always inlined
...
Roland reported that his DELL T5810 sports a value add BIOS which
completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with
random negative TSC_ADJUST values, different on all CPUs. That renders the
TSC useless because the sycnchronization check fails.
Roland tested the new TSC_ADJUST mechanism. While it manages to readjust
the TSCs he needs to disable the TSC deadline timer, otherwise the machine
just stops booting.
Deeper investigation unearthed that the TSC deadline timer is sensitive to
the TSC_ADJUST value. Writing TSC_ADJUST to a negative value results in an
interrupt storm caused by the TSC deadline timer.
This does not make any sense and it's hard to imagine what kind of hardware
wreckage is behind that misfeature, but it's reliably reproducible on other
systems which have TSC_ADJUST and TSC deadline timer.
While it would be understandable that a big enough negative value which
moves the resulting TSC readout into the negative space could have the
described effect, this happens even with a adjust value of -1, which keeps
the TSC readout definitely in the positive space. The compare register for
the TSC deadline timer is set to a positive value larger than the TSC, but
despite not having reached the deadline the interrupt is raised
immediately. If this happens on the boot CPU, then the machine dies
silently because this setup happens before the NMI watchdog is armed.
Further experiments showed that any other adjustment of TSC_ADJUST works as
expected as long as it stays in the positive range. The direction of the
adjustment has no influence either. See the lkml link for further analysis.
Yet another proof for the theory that timers are designed by janitors and
the underlying (obviously undocumented) mechanisms which allow BIOSes to
wreckage them are considered a feature. Well done Intel - NOT!
To address this wreckage add the following sanity measures:
- If the TSC_ADJUST value on the boot cpu is not 0, set it to 0
- If the TSC_ADJUST value on any cpu is negative, set it to 0
- Prevent the cross package synchronization mechanism from setting negative
TSC_ADJUST values.
Reported-and-tested-by: Roland Scheidegger <rscheidegger_lists@hispeed.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
Cc: Kevin Stanton <kevin.b.stanton@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Allen Hung <allen_hung@dell.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20161213131211.397588033@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>