[ Note, this is a Git cherry-pick of the following commit:
2b67799bdf25 ("x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD")
... for easier x86 PTI code testing and back-porting. ]
The latest AMD AMD64 Architecture Programmer's Manual
adds a CPUID feature XSaveErPtr (CPUID_Fn80000008_EBX[2]).
If this feature is set, the FXSAVE, XSAVE, FXSAVEOPT, XSAVEC, XSAVES
/ FXRSTOR, XRSTOR, XRSTORS always save/restore error pointers,
thus making the X86_BUG_FXSAVE_LEAK workaround obsolete on such CPUs.
Signed-Off-By: Rudolf Marek <r.marek@assembler.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/bdcebe90-62c5-1f05-083c-eba7f08b2540@assembler.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- fix the s2ram regression related to confusion around segment
register restoration, plus related cleanups that make the code more
robust
- a guess-unwinder Kconfig dependency fix
- an isoimage build target fix for certain tool chain combinations
- instruction decoder opcode map fixes+updates, and the syncing of
the kernel decoder headers to the objtool headers
- a kmmio tracing fix
- two 5-level paging related fixes
- a topology enumeration fix on certain SMP systems"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version
x86/decoder: Fix and update the opcodes map
x86/power: Make restore_processor_context() sane
x86/power/32: Move SYSENTER MSR restoration to fix_processor_context()
x86/power/64: Use struct desc_ptr for the IDT in struct saved_context
x86/unwinder/guess: Prevent using CONFIG_UNWINDER_GUESS=y with CONFIG_STACKDEPOT=y
x86/build: Don't verify mtools configuration file for isoimage
x86/mm/kmmio: Fix mmiotrace for page unaligned addresses
x86/boot/compressed/64: Print error if 5-level paging is not supported
x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
x86/smpboot: Do not use smp_num_siblings in __max_logical_packages calculation
Uprobe is a tracing mechanism for userspace programs.
Typical uprobe will incur overhead of two traps.
First trap is caused by replaced trap insn, and
the second trap is to execute the original displaced
insn in user space.
To reduce the overhead, kernel provides hooks
for architectures to emulate the original insn
and skip the second trap. In x86, emulation
is done for certain branch insns.
This patch extends the emulation to "push <reg>"
insns. These insns are typical in the beginning
of the function. For example, bcc
in https://github.com/iovisor/bcc repo provides
tools to measure funclantency, detect memleak, etc.
The tools will place uprobes in the beginning of
function and possibly uretprobes at the end of function.
This patch is able to reduce the trap overhead for
uprobe from 2 to 1.
Without this patch, uretprobe will typically incur
three traps. With this patch, if the function starts
with "push" insn, the number of traps can be
reduced from 3 to 2.
An experiment was conducted on two local VMs,
fedora 26 64-bit VM and 32-bit VM, both 4 processors
and 4GB memory, booted with latest tip repo (and this patch).
The host is MacBook with intel i7 processor.
The test program looks like:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
static void test() __attribute__((noinline));
void test() {}
int main() {
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < 1000000; i++) {
test();
}
gettimeofday(&end, NULL);
printf("%ld\n", ((end.tv_sec * 1000000 + end.tv_usec)
- (start.tv_sec * 1000000 + start.tv_usec)));
return 0;
}
The program is compiled without optimization, and
the first insn for function "test" is "push %rbp".
The host is relatively idle.
Before the test run, the uprobe is inserted as below for uprobe:
echo 'p <binary>:<test_func_offset>' > /sys/kernel/debug/tracing/uprobe_events
echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
and for uretprobe:
echo 'r <binary>:<test_func_offset>' > /sys/kernel/debug/tracing/uprobe_events
echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
Unit: microsecond(usec) per loop iteration
x86_64 W/ this patch W/O this patch
uprobe 1.55 3.1
uretprobe 2.0 3.6
x86_32 W/ this patch W/O this patch
uprobe 1.41 3.5
uretprobe 1.75 4.0
You can see that this patch significantly reduced the overhead,
50% for uprobe and 44% for uretprobe on x86_64, and even more
on x86_32.
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/20171201001202.3706564-1-yhs@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Documentation/x86/topology.txt defines smp_num_siblings as "The number of
threads in a core". Since commit bbb65d2d36 ("x86: use cpuid vector 0xb
when available for detecting cpu topology") smp_num_siblings is the
maximum number of threads in a core. If Simultaneous MultiThreading
(SMT) is disabled on a system, smp_num_siblings is 2 and not 1 as
expected.
Use topology_max_smt_threads(), which contains the active numer of threads,
in the __max_logical_packages calculation.
On a single socket, single core, single thread system __max_smt_threads has
not been updated when the __max_logical_packages calculation happens, so its
zero which makes the package estimate fail. Initialize it to one, which is
the minimum number of threads on a core.
[ tglx: Folded the __max_smt_threads fix in ]
Fixes: b4c0a7326f ("x86/smpboot: Fix __max_logical_packages estimate")
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Prarit Bhargava <prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jakub Kicinski <kubakici@wp.pl>
Cc: netdev@vger.kernel.org
Cc: "netdev@vger.kernel.org"
Cc: Clark Williams <williams@redhat.com>
Link: https://lkml.kernel.org/r/20171204164521.17870-1-prarit@redhat.com
Pull misc x86 fixes from Ingo Molnar:
- make CR4 handling irq-safe, which bug vmware guests ran into
- don't crash on early IRQs in Xen guests
- don't crash secondary CPU bringup if #UD assisted WARN()ings are
triggered
- make X86_BUG_FXSAVE_LEAK optional on newer AMD CPUs that have the fix
- fix AMD Fam17h microcode loading
- fix broadcom_postcore_init() if ACPI is disabled
- fix resume regression in __restore_processor_context()
- fix Sparse warnings
- fix a GCC-8 warning
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Change time() prototype to match __vdso_time()
x86: Fix Sparse warnings about non-static functions
x86/power: Fix some ordering bugs in __restore_processor_context()
x86/PCI: Make broadcom_postcore_init() check acpi_disabled
x86/microcode/AMD: Add support for fam17h microcode loading
x86/cpufeatures: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
x86/idt: Load idt early in start_secondary
x86/xen: Support early interrupts in xen pv guests
x86/tlb: Disable interrupts when changing CR4
x86/tlb: Refactor CR4 setting and shadow write
The McaIntrCfg register (MSRC000_0410), previously known as CU_DEFER_ERR,
is used on SMCA systems to set the LVT offset for the Threshold and
Deferred error interrupts.
This register was used on non-SMCA systems to also set the Deferred
interrupt type in bits 2:1. However, these bits are reserved on SMCA
systems.
Only set MSRC000_0410[2:1] on non-SMCA systems.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20171120162646.5210-1-Yazen.Ghannam@amd.com
According to the Intel SDM Volume 3B (253669-063US, July 2017), action
optional (SRAO) errors can be reported either via MCE or CMC:
In cases when SRAO is signaled via CMCI the error signature is
indicated via UC=1, PCC=0, S=0.
Type(*1) UC EN PCC S AR Signaling
---------------------------------------------------------------
UC 1 1 1 x x MCE
SRAR 1 1 0 1 1 MCE
SRAO 1 x(*2) 0 x(*2) 0 MCE/CMC
UCNA 1 x 0 0 0 CMC
CE 0 x x x x CMC
NOTES:
1. SRAR, SRAO and UCNA errors are supported by the processor only
when IA32_MCG_CAP[24] (MCG_SER_P) is set.
2. EN=1, S=1 when signaled via MCE. EN=x, S=0 when signaled via CMC.
And there is a description in 15.6.2 UCR Error Reporting and Logging, for
bit S:
S (Signaling) flag, bit 56 - Indicates (when set) that a machine check
exception was generated for the UCR error reported in this MC bank...
When the S flag in the IA32_MCi_STATUS register is clear, this UCR error
was not signaled via a machine check exception and instead was reported
as a corrected machine check (CMC).
So merge the two cases and just remove the S=0 check for SRAO in
mce_severity().
[ Borislav: Massage commit message.]
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Chen Wei <chenwei68@huawei.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1511575548-41992-1-git-send-email-xiexiuqi@huawei.com
The ACPI code supporting system transitions to sleep states uses
an internal blacklist to apply special handling to some machines
reported to behave incorrectly in some ways.
However, some entries of that blacklist cover problematic as well as
non-problematic systems, so give the users of the latter a chance to
ignore the blacklist and run their systems in the default way by
adding acpi_sleep=nobl to the kernel command line.
For example, that allows the users of Dell XPS13 9360 systems not
affected by the issue that caused the blacklist entry for this
machine to be added by commit 71630b7a83 (ACPI / PM: Blacklist Low
Power S0 Idle _DSM for Dell XPS13 9360) to use suspend-to-idle with
the Low Power S0 Idle _DSM interface which in principle should be
more energy-efficient than S3 on them.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull misc x86 fixes from Ingo Molnar:
- topology enumeration fixes
- KASAN fix
- two entry fixes (not yet the big series related to KASLR)
- remove obsolete code
- instruction decoder fix
- better /dev/mem sanity checks, hopefully working better this time
- pkeys fixes
- two ACPI fixes
- 5-level paging related fixes
- UMIP fixes that should make application visible faults more debuggable
- boot fix for weird virtualization environment
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/decoder: Add new TEST instruction pattern
x86/PCI: Remove unused HyperTransport interrupt support
x86/umip: Fix insn_get_code_seg_params()'s return value
x86/boot/KASLR: Remove unused variable
x86/entry/64: Add missing irqflags tracing to native_load_gs_index()
x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow
x86/entry/64: Fix entry_SYSCALL_64_after_hwframe() IRQ tracing
x86/pkeys/selftests: Fix protection keys write() warning
x86/pkeys/selftests: Rename 'si_pkey' to 'siginfo_pkey'
x86/mpx/selftests: Fix up weird arrays
x86/pkeys: Update documentation about availability
x86/umip: Print a warning into the syslog if UMIP-protected instructions are used
x86/smpboot: Fix __max_logical_packages estimate
x86/topology: Avoid wasting 128k for package id array
perf/x86/intel/uncore: Cache logical pkg id in uncore driver
x86/acpi: Reduce code duplication in mp_override_legacy_irq()
x86/acpi: Handle SCI interrupts above legacy space gracefully
x86/boot: Fix boot failure when SMP MP-table is based at 0
x86/mm: Limit mmap() of /dev/mem to valid physical addresses
x86/selftests: Add test for mapping placement for 5-level paging
...
CR4 modifications are implemented as RMW operations which update a shadow
variable and write the result to CR4. The RMW operation is protected by
preemption disable, but there is no enforcement or debugging mechanism.
CR4 modifications happen also in interrupt context via
__native_flush_tlb_global(). This implementation does not affect a
interrupted thread context CR4 operation, because the CR4 toggle restores
the original content and does not modify the shadow variable.
So the current situation seems to be safe, but a recent patch tried to add
an actual RMW operation in interrupt context, which will cause subtle
corruptions.
To prevent that and make the CR4 handling future proof:
- Add a lockdep assertion to __cr4_set() which will catch interrupt
enabled invocations
- Disable interrupts in the cr4 manipulator inlines
- Rename cr4_toggle_bits() to cr4_toggle_bits_irqsoff(). This is called
from __switch_to_xtra() where interrupts are already disabled and
performance matters.
All other call sites are not performance critical, so the extra overhead of
an additional local_irq_save/restore() pair is not a problem. If new call
sites care about performance then the necessary _irqsoff() variants can be
added.
[ tglx: Condensed the patch by moving the irq protection inside the
manipulator functions. Updated changelog ]
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Luck <tony.luck@intel.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: nadav.amit@gmail.com
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20171125032907.2241-3-namit@vmware.com
In order to save on redundant structs definitions
insn_get_code_seg_params() was made to return two 4-bit values in a char
but clang complains:
arch/x86/lib/insn-eval.c:780:10: warning: implicit conversion from 'int' to 'char'
changes value from 132 to -124 [-Wconstant-conversion]
return INSN_CODE_SEG_PARAMS(4, 8);
~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~
./arch/x86/include/asm/insn-eval.h:16:57: note: expanded from macro 'INSN_CODE_SEG_PARAMS'
#define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4))
Those two values do get picked apart afterwards the opposite way of how
they were ORed so wrt to the LSByte, the return value is the same.
But this function returns -EINVAL in the error case, which is an int. So
make it return an int which is the native word size anyway and thus fix
the clang warning.
Reported-by: Kees Cook <keescook@google.com>
Reported-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ricardo.neri-calderon@linux.intel.com
Link: https://lkml.kernel.org/r/20171123091951.1462-1-bp@alien8.de
Print a rate-limited warning when a user-space program attempts to execute
any of the instructions that UMIP protects (i.e., SGDT, SIDT, SLDT, STR
and SMSW).
This is useful, because when CONFIG_X86_INTEL_UMIP=y is selected and
supported by the hardware, user space programs that try to execute such
instructions will receive a SIGSEGV signal that they might not expect.
In the specific cases for which emulation is provided (instructions SGDT,
SIDT and SMSW in protected and virtual-8086 modes), no signal is
generated. However, a warning is helpful to encourage updates in such
programs to avoid the use of such instructions.
Warnings are printed via a customized printk() function that also provides
information about the program that attempted to use the affected
instructions.
Utility macros are defined to wrap umip_printk() for the error and warning
kernel log levels.
While here, replace an existing call to the generic rate-limited pr_err()
with the new umip_pr_err().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1511233476-17088-1-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull two power management fixes from Rafael Wysocki:
"This is the change making /proc/cpuinfo on x86 report current CPU
frequency in "cpu MHz" again in all cases and an additional one
dealing with an overzealous check in one of the helper routines in the
runtime PM framework"
* tag 'pm-fixes-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / runtime: Drop children check from __pm_runtime_set_status()
x86 / CPU: Always show current CPU frequency in /proc/cpuinfo
Analyzing large early boot allocations unveiled the logical package id
storage as a prominent memory waste. Since commit 1f12e32f4c
("x86/topology: Create logical package id") every 64-bit system allocates a
128k array to convert logical package ids.
This happens because the array is sized for MAX_LOCAL_APIC which is always
32k on 64bit systems, and it needs 4 bytes for each entry.
This is fairly wasteful, especially for the common case of having only one
socket, which uses exactly 4 byte out of 128K.
There is no user of the package id map which is performance critical, so
the lookup is not required to be O(1). Store the logical processor id in
cpu_data and use a loop based lookup.
To keep the mapping stable accross cpu hotplug operations, add a flag to
cpu_data which is set when the CPU is brought up the first time. When the
flag is set, then cpu_data is not reinitialized by copying boot_cpu_data on
subsequent bringups.
[ tglx: Rename the flag to 'initialized', use proper pointers instead of
repeated cpu_data(x) evaluation and massage changelog. ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: He Chen <he.chen@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Piotr Luc <piotr.luc@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171114124257.22013-3-prarit@redhat.com