Pull pidfd updates from Christian Brauner:
"This adds two main features.
- First, it adds polling support for pidfds. This allows process
managers to know when a (non-parent) process dies in a race-free
way.
The notification mechanism used follows the same logic that is
currently used when the parent of a task is notified of a child's
death. With this patchset it is possible to put pidfds in an
{e}poll loop and get reliable notifications for process (i.e.
thread-group) exit.
- The second feature compliments the first one by making it possible
to retrieve pollable pidfds for processes that were not created
using CLONE_PIDFD.
A lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these
processes a caller can currently not create a pollable pidfd. This
is a problem for Android's low memory killer (LMK) and service
managers such as systemd.
Both patchsets are accompanied by selftests.
It's perhaps worth noting that the work done so far and the work done
in this branch for pidfd_open() and polling support do already see
some adoption:
- Android is in the process of backporting this work to all their LTS
kernels [1]
- Service managers make use of pidfd_send_signal but will need to
wait until we enable waiting on pidfds for full adoption.
- And projects I maintain make use of both pidfd_send_signal and
CLONE_PIDFD [2] and will use polling support and pidfd_open() too"
[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22
[2] aab6e3eb73/src/lxc/start.c (L1753)
* tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add pidfd_open() tests
arch: wire-up pidfd_open()
pid: add pidfd_open()
pidfd: add polling selftests
pidfd: add polling support
The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.
The reason is that the static key which controls the pinning is marked RO
after init. The kvm_intel module contains a CR4 write which requires to
update the static key entry list. That obviously does not work when the key
is in a RO section.
With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
uses the paravirt indirection and the actual write function is built in.
As the key is intended to be immutable after init, move
native_write_cr0/4() out of line.
While at it consolidate the update of the cr4 shadow variable and store the
value right away when the pinning is initialized on a booting CPU. No point
in reading it back 20 instructions later. This allows to confine the static
key and the pinning variable to cpu/common and allows to mark them static.
Fixes: 8dbec27a24 ("x86/asm: Pin sensitive CR0 bits")
Fixes: 873d50d58f ("x86/asm: Pin sensitive CR4 bits")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Xi Ruoyao <xry111@mengyan1223.wang>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907102140340.1758@nanos.tec.linutronix.de
clang points out that the computation of LOWMEM_PAGES causes a signed
integer overflow on 32-bit x86:
arch/x86/kernel/head32.c:83:20: error: signed shift result (0x100000000) requires 34 bits to represent, but 'int' only has 32 bits [-Werror,-Wshift-overflow]
(PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT);
^~~~~~~~~~~~
arch/x86/include/asm/pgtable_32.h:109:27: note: expanded from macro 'LOWMEM_PAGES'
#define LOWMEM_PAGES ((((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT))
~^ ~~
arch/x86/include/asm/pgtable_32.h:98:34: note: expanded from macro 'PAGE_TABLE_SIZE'
#define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
Use the _ULL() macro to make it a 64-bit constant.
Fixes: 1e620f9b23 ("x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190710130522.1802800-1-arnd@arndb.de
We get a warning when build kernel W=1:
arch/x86/kvm/../../../virt/kvm/eventfd.c:48:1: warning: no previous prototype for ‘kvm_arch_irqfd_allowed’ [-Wmissing-prototypes]
kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args)
^
The reason is kvm_arch_irqfd_allowed() is declared in arch/x86/kvm/irq.h,
which is not included by eventfd.c. Considering kvm_arch_irqfd_allowed()
is a weakly defined function in eventfd.c, remove the declaration to
kvm_host.h can fix this.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KASAN shows the following splat during boot:
BUG: KASAN: unknown-crash in unwind_next_frame+0x3f6/0x490
Read of size 8 at addr ffffffff84007db0 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Tainted: G T 5.2.0-rc6-00013-g7457c0d #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x19/0x1b
print_address_description+0x1b0/0x2b2
__kasan_report+0x10f/0x171
kasan_report+0x12/0x1c
__asan_load8+0x54/0x81
unwind_next_frame+0x3f6/0x490
unwind_next_frame+0x1b/0x23
arch_stack_walk+0x68/0xa5
stack_trace_save+0x7b/0xa0
save_trace+0x3c/0x93
mark_lock+0x1ef/0x9b1
lock_acquire+0x122/0x221
__mutex_lock+0xb6/0x731
mutex_lock_nested+0x16/0x18
_vm_unmap_aliases+0x141/0x183
vm_unmap_aliases+0x14/0x16
change_page_attr_set_clr+0x15e/0x2f2
set_memory_4k+0x2a/0x2c
check_bugs+0x11fd/0x1298
start_kernel+0x793/0x7eb
x86_64_start_reservations+0x55/0x76
x86_64_start_kernel+0x87/0xaa
secondary_startup_64+0xa4/0xb0
Memory state around the buggy address:
ffffffff84007c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
ffffffff84007d00: f1 00 00 00 00 00 00 00 00 00 f2 f2 f2 f3 f3 f3
>ffffffff84007d80: f3 79 be 52 49 79 be 00 00 00 00 00 00 00 00 f1
It turns out that int3_selftest() is corrupting the stack. The problem is
that the KASAN-ified version of int3_magic() is much less trivial than the
C code appears. It clobbers several unexpected registers. So when the
selftest's INT3 is converted to an emulated call to int3_magic(), the
registers are clobbered and Bad Things happen when the function returns.
Fix this by converting int3_magic() to the trivial ASM function it should
be, avoiding all calling convention issues. Also add ASM_CALL_CONSTRAINT to
the INT3 ASM, since it contains a 'CALL'.
[peterz: cribbed changelog from josh]
Fixes: 7457c0da02 ("x86/alternatives: Add int3_emulate_call() selftest")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20190709125744.GB3402@hirez.programming.kicks-ass.net
Pull Documentation updates from Jonathan Corbet:
"It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with
other trees, unfortunately. He has a lot more of these waiting on
the wings that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos,
and one on Spectre vulnerabilities.
- Various improvements to the build system, including automatic
markup of function() references because some people, for reasons I
will never understand, were of the opinion that
:c:func:``function()`` is unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc"
* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
docs: automarkup.py: ignore exceptions when seeking for xrefs
docs: Move binderfs to admin-guide
Disable Sphinx SmartyPants in HTML output
doc: RCU callback locks need only _bh, not necessarily _irq
docs: format kernel-parameters -- as code
Doc : doc-guide : Fix a typo
platform: x86: get rid of a non-existent document
Add the RCU docs to the core-api manual
Documentation: RCU: Add TOC tree hooks
Documentation: RCU: Rename txt files to rst
Documentation: RCU: Convert RCU UP systems to reST
Documentation: RCU: Convert RCU linked list to reST
Documentation: RCU: Convert RCU basic concepts to reST
docs: filesystems: Remove uneeded .rst extension on toctables
scripts/sphinx-pre-install: fix out-of-tree build
docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
Documentation: PGP: update for newer HW devices
Documentation: Add section about CPU vulnerabilities for Spectre
Documentation: platform: Delete x86-laptop-drivers.txt
docs: Note that :c:func: should no longer be used
...
Pull x865 kdump updates from Thomas Gleixner:
"Yet more kexec/kdump updates:
- Properly support kexec when AMD's memory encryption (SME) is
enabled
- Pass reserved e820 ranges to the kexec kernel so both PCI and SME
can work"
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/proc/vmcore: Enable dumping of encrypted memory when SEV was active
x86/kexec: Set the C-bit in the identity map page table when SEV is active
x86/kexec: Do not map kexec area as decrypted when SEV is active
x86/crash: Add e820 reserved ranges to kdump kernel's e820 table
x86/mm: Rework ioremap resource mapping determination
x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED
x86/mm: Create a workarea in the kernel for SME early encryption
x86/mm: Identify the end of the kernel area to be reserved
Pull x86 boot updates from Thomas Gleixner:
"Assorted updates to kexec/kdump:
- Proper kexec support for 4/5-level paging and jumping from a
5-level to a 4-level paging kernel.
- Make the EFI support for kexec/kdump more robust
- Enforce that the GDT is properly aligned instead of getting the
alignment by chance"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kdump/64: Restrict kdump kernel reservation to <64TB
x86/kexec/64: Prevent kexec from 5-level paging to a 4-level only kernel
x86/boot: Add xloadflags bits to check for 5-level paging support
x86/boot: Make the GDT 8-byte aligned
x86/kexec: Add the ACPI NVS region to the ident map
x86/boot: Call get_rsdp_addr() after console_init()
Revert "x86/boot: Disable RSDP parsing temporarily"
x86/boot: Use efi_setup_data for searching RSDP on kexec-ed kernels
x86/kexec: Add the EFI system tables and ACPI tables to the ident map
Pull perf updates from Ingo Molnar:
"The main changes in this cycle on the kernel side were:
- CPU PMU and uncore driver updates to Intel Snow Ridge, IceLake,
KabyLake, AmberLake and WhiskeyLake CPUs.
- Rework the MSR probing infrastructure to make it more robust, make
it work better on virtualized systems and to better expose it on
sysfs.
- Rework PMU attributes group support based on the feedback from
Greg. The core sysfs patch that adds sysfs_update_groups() was
acked by Greg.
There's a lot of perf tooling changes as well, all around the place:
- vendor updates to Intel, cs-etm (ARM), ARM64, s390,
- various enhancements to Intel PT tooling support:
- Improve CBR (Core to Bus Ratio) packets support.
- Export power and ptwrite events to sqlite and postgresql.
- Add support for decoding PEBS via PT packets.
- Add support for samples to contain IPC ratio, collecting cycles
information from CYC packets, showing the IPC info periodically
- Allow using time ranges
- lots of updates to perf pmu, perf stat, perf trace, eBPF support,
perf record, perf diff, etc. - please see the shortlog and Git log
for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (252 commits)
tools arch x86: Sync asm/cpufeatures.h with the with the kernel
tools build: Check if gettid() is available before providing helper
perf jvmti: Address gcc string overflow warning for strncpy()
perf python: Remove -fstack-protector-strong if clang doesn't have it
perf annotate TUI browser: Do not use member from variable within its own initialization
perf tests: Fix record+probe_libc_inet_pton.sh for powerpc64
perf evsel: Do not rely on errno values for precise_ip fallback
perf thread: Allow references to thread objects after machine__exit()
perf header: Assign proper ff->ph in perf_event__synthesize_features()
tools arch kvm: Sync kvm headers with the kernel sources
perf script: Allow specifying the files to process guest samples
perf tools metric: Don't include duration_time in group
perf list: Avoid extra : for --raw metrics
perf vendor events intel: Metric fixes for SKX/CLX
perf tools: Fix typos / broken sentences
perf jevents: Add support for Hisi hip08 L3C PMU aliasing
perf jevents: Add support for Hisi hip08 HHA PMU aliasing
perf jevents: Add support for Hisi hip08 DDRC PMU aliasing
perf pmu: Support more complex PMU event aliasing
perf diff: Documentation -c cycles option
...
Pull force_sig() argument change from Eric Biederman:
"A source of error over the years has been that force_sig has taken a
task parameter when it is only safe to use force_sig with the current
task.
The force_sig function is built for delivering synchronous signals
such as SIGSEGV where the userspace application caused a synchronous
fault (such as a page fault) and the kernel responded with a signal.
Because the name force_sig does not make this clear, and because the
force_sig takes a task parameter the function force_sig has been
abused for sending other kinds of signals over the years. Slowly those
have been fixed when the oopses have been tracked down.
This set of changes fixes the remaining abusers of force_sig and
carefully rips out the task parameter from force_sig and friends
making this kind of error almost impossible in the future"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
signal: Remove the signal number and task parameters from force_sig_info
signal: Factor force_sig_info_to_task out of force_sig_info
signal: Generate the siginfo in force_sig
signal: Move the computation of force into send_signal and correct it.
signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
signal: Remove the task parameter from force_sig_fault
signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
signal: Explicitly call force_sig_fault on current
signal/unicore32: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from ptrace_break
signal/nds32: Remove tsk parameter from send_sigtrap
signal/riscv: Remove tsk parameter from do_trap
signal/sh: Remove tsk parameter from force_sig_info_fault
signal/um: Remove task parameter from send_sigtrap
signal/x86: Remove task parameter from send_sigtrap
signal: Remove task parameter from force_sig_mceerr
signal: Remove task parameter from force_sig
signal: Remove task parameter from force_sigsegv
...
Pull crypto updates from Herbert Xu:
"Here is the crypto update for 5.3:
API:
- Test shash interface directly in testmgr
- cra_driver_name is now mandatory
Algorithms:
- Replace arc4 crypto_cipher with library helper
- Implement 5 way interleave for ECB, CBC and CTR on arm64
- Add xxhash
- Add continuous self-test on noise source to drbg
- Update jitter RNG
Drivers:
- Add support for SHA204A random number generator
- Add support for 7211 in iproc-rng200
- Fix fuzz test failures in inside-secure
- Fix fuzz test failures in talitos
- Fix fuzz test failures in qat"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (143 commits)
crypto: stm32/hash - remove interruptible condition for dma
crypto: stm32/hash - Fix hmac issue more than 256 bytes
crypto: stm32/crc32 - rename driver file
crypto: amcc - remove memset after dma_alloc_coherent
crypto: ccp - Switch to SPDX license identifiers
crypto: ccp - Validate the the error value used to index error messages
crypto: doc - Fix formatting of new crypto engine content
crypto: doc - Add parameter documentation
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
crypto: arm64/aes-ce - add 5 way interleave routines
crypto: talitos - drop icv_ool
crypto: talitos - fix hash on SEC1.
crypto: talitos - move struct talitos_edesc into talitos.h
lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
crypto/NX: Set receive window credits to max number of CRBs in RxFIFO
crypto: asymmetric_keys - select CRYPTO_HASH where needed
crypto: serpent - mark __serpent_setkey_sbox noinline
crypto: testmgr - dynamically allocate crypto_shash
crypto: testmgr - dynamically allocate testvec_config
crypto: talitos - eliminate unneeded 'done' functions at build time
...
Pull integrity updates from Mimi Zohar:
"Bug fixes, code clean up, and new features:
- IMA policy rules can be defined in terms of LSM labels, making the
IMA policy dependent on LSM policy label changes, in particular LSM
label deletions. The new environment, in which IMA-appraisal is
being used, frequently updates the LSM policy and permits LSM label
deletions.
- Prevent an mmap'ed shared file opened for write from also being
mmap'ed execute. In the long term, making this and other similar
changes at the VFS layer would be preferable.
- The IMA per policy rule template format support is needed for a
couple of new/proposed features (eg. kexec boot command line
measurement, appended signatures, and VFS provided file hashes).
- Other than the "boot-aggregate" record in the IMA measuremeent
list, all other measurements are of file data. Measuring and
storing the kexec boot command line in the IMA measurement list is
the first buffer based measurement included in the measurement
list"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
integrity: Introduce struct evm_xattr
ima: Update MAX_TEMPLATE_NAME_LEN to fit largest reasonable definition
KEXEC: Call ima_kexec_cmdline to measure the boot command line args
IMA: Define a new template field buf
IMA: Define a new hook to measure the kexec boot command line arguments
IMA: support for per policy rule template formats
integrity: Fix __integrity_init_keyring() section mismatch
ima: Use designated initializers for struct ima_event_data
ima: use the lsm policy update notifier
LSM: switch to blocking policy update notifiers
x86/ima: fix the Kconfig dependency for IMA_ARCH_POLICY
ima: Make arch_policy_entry static
ima: prevent a file already mmap'ed write to be mmap'ed execute
x86/ima: check EFI SetupMode too
Pull x86 topology updates from Ingo Molnar:
"Implement multi-die topology support on Intel CPUs and expose the die
topology to user-space tooling, by Len Brown, Kan Liang and Zhang Rui.
These changes should have no effect on the kernel's existing
understanding of topologies, i.e. there should be no behavioral impact
on cache, NUMA, scheduler, perf and other topologies and overall
system performance"
* 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support
hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
perf/x86/intel/cstate: Support multi-die/package
perf/x86/intel/rapl: Support multi-die/package
perf/x86/intel/uncore: Support multi-die/package
topology: Create core_cpus and die_cpus sysfs attributes
topology: Create package_cpus sysfs attribute
hwmon/coretemp: Support multi-die/package
powercap/intel_rapl: Update RAPL domain name and debug messages
thermal/x86_pkg_temp_thermal: Support multi-die/package
powercap/intel_rapl: Support multi-die/package
powercap/intel_rapl: Simplify rapl_find_package()
x86/topology: Define topology_logical_die_id()
x86/topology: Define topology_die_id()
cpu/topology: Export die_id
x86/topology: Create topology_max_die_per_package()
x86/topology: Add CPUID.1F multi-die/package support
Pull x86 platform updayes from Ingo Molnar:
"Most of the commits add ACRN hypervisor guest support, plus two
cleanups"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/jailhouse: Mark jailhouse_x2apic_available() as __init
x86/platform/geode: Drop <linux/gpio.h> includes
x86/acrn: Use HYPERVISOR_CALLBACK_VECTOR for ACRN guest upcall vector
x86: Add support for Linux guests on an ACRN hypervisor
x86/Kconfig: Add new X86_HV_CALLBACK_VECTOR config symbol
Pull x86 paravirt updates from Ingo Molnar:
"A handful of paravirt patching code enhancements to make it more
robust against patching failures, and related cleanups and not so
related cleanups - by Thomas Gleixner and myself"
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Rename paravirt_patch_site::instrtype to paravirt_patch_site::type
x86/paravirt: Standardize 'insn_buff' variable names
x86/paravirt: Match paravirt patchlet field definition ordering to initialization ordering
x86/paravirt: Replace the paravirt patch asm magic
x86/paravirt: Unify the 32/64 bit paravirt patching code
x86/paravirt: Detect over-sized patching bugs in paravirt_patch_call()
x86/paravirt: Detect over-sized patching bugs in paravirt_patch_insns()
x86/paravirt: Remove bogus extern declarations
Pull x86 AVX512 status update from Ingo Molnar:
"This adds a new ABI that the main scheduler probably doesn't want to
deal with but HPC job schedulers might want to use: the
AVX512_elapsed_ms field in the new /proc/<pid>/arch_status task status
file, which allows the user-space job scheduler to cluster such tasks,
to avoid turbo frequency drops"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/filesystems/proc.txt: Add arch_status file
x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
proc: Add /proc/<pid>/arch_status
Pull x86 cleanups from Ingo Molnar:
"Misc small cleanups: removal of superfluous code and coding style
cleanups mostly"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kexec: Make variable static and config dependent
x86/defconfigs: Remove useless UEVENT_HELPER_PATH
x86/amd_nb: Make hygon_nb_misc_ids static
x86/tsc: Move inline keyword to the beginning of function declarations
x86/io_delay: Define IO_DELAY macros in C instead of Kconfig
x86/io_delay: Break instead of fallthrough in switch statement
Pull x86 cache resource control update from Ingo Molnar:
"Two cleanup patches"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Cleanup cbm_ensure_valid()
x86/resctrl: Use _ASM_BX to avoid ifdeffery
Pull x86 asm updates from Ingo Molnar:
"Most of the changes relate to Peter Zijlstra's cleanup of ptregs
handling, in particular the i386 part is now much simplified and
standardized - no more partial ptregs stack frames via the esp/ss
oddity. This simplifies ftrace, kprobes, the unwinder, ptrace, kdump
and kgdb.
There's also a CR4 hardening enhancements by Kees Cook, to make the
generic platform functions such as native_write_cr4() less useful as
ROP gadgets that disable SMEP/SMAP. Also protect the WP bit of CR0
against similar attacks.
The rest is smaller cleanups/fixes"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Add int3_emulate_call() selftest
x86/stackframe/32: Allow int3_emulate_push()
x86/stackframe/32: Provide consistent pt_regs
x86/stackframe, x86/ftrace: Add pt_regs frame annotations
x86/stackframe, x86/kprobes: Fix frame pointer annotations
x86/stackframe: Move ENCODE_FRAME_POINTER to asm/frame.h
x86/entry/32: Clean up return from interrupt preemption path
x86/asm: Pin sensitive CR0 bits
x86/asm: Pin sensitive CR4 bits
Documentation/x86: Fix path to entry_32.S
x86/asm: Remove unused TASK_TI_flags from asm-offsets.c
Pull scheduler updates from Ingo Molnar:
- Remove the unused per rq load array and all its infrastructure, by
Dietmar Eggemann.
- Add utilization clamping support by Patrick Bellasi. This is a
refinement of the energy aware scheduling framework with support for
boosting of interactive and capping of background workloads: to make
sure critical GUI threads get maximum frequency ASAP, and to make
sure background processing doesn't unnecessarily move to cpufreq
governor to higher frequencies and less energy efficient CPU modes.
- Add the bare minimum of tracepoints required for LISA EAS regression
testing, by Qais Yousef - which allows automated testing of various
power management features, including energy aware scheduling.
- Restructure the former tsk_nr_cpus_allowed() facility that the -rt
kernel used to modify the scheduler's CPU affinity logic such as
migrate_disable() - introduce the task->cpus_ptr value instead of
taking the address of &task->cpus_allowed directly - by Sebastian
Andrzej Siewior.
- Misc optimizations, fixes, cleanups and small enhancements - see the
Git log for details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched/uclamp: Add uclamp support to energy_compute()
sched/uclamp: Add uclamp_util_with()
sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
sched/uclamp: Set default clamps for RT tasks
sched/uclamp: Reset uclamp values on RESET_ON_FORK
sched/uclamp: Extend sched_setattr() to support utilization clamping
sched/core: Allow sched_setattr() to use the current policy
sched/uclamp: Add system default clamps
sched/uclamp: Enforce last task's UCLAMP_MAX
sched/uclamp: Add bucket local max tracking
sched/uclamp: Add CPU's clamp buckets refcounting
sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
sched/debug: Export the newly added tracepoints
sched/debug: Add sched_overutilized tracepoint
sched/debug: Add new tracepoint to track PELT at se level
sched/debug: Add new tracepoints to track PELT at rq level
sched/debug: Add a new sched_trace_*() helper functions
sched/autogroup: Make autogroup_path() always available
sched/wait: Deduplicate code with do-while
sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
...
Pull RAS updates from Ingo Molnar:
"Boris is on vacation so I'm sending the RAS bits this time. The main
changes were:
- Various RAS/CEC improvements and fixes by Borislav Petkov:
- error insertion fixes
- offlining latency fix
- memory leak fix
- additional sanity checks
- cleanups
- debug output improvements
- More SMCA enhancements by Yazen Ghannam:
- make banks truly per-CPU which they are in the hardware
- don't over-cache certain registers
- make the number of MCA banks per-CPU variable
The long term goal with these changes is to support future
heterogenous SMCA extensions.
- Misc fixes and improvements"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Do not check return value of debugfs_create functions
x86/MCE: Determine MCA banks' init state properly
x86/MCE: Make the number of MCA banks a per-CPU variable
x86/MCE/AMD: Don't cache block addresses on SMCA systems
x86/MCE: Make mce_banks a per-CPU array
x86/MCE: Make struct mce_banks[] static
RAS/CEC: Add copyright
RAS/CEC: Add CONFIG_RAS_CEC_DEBUG and move CEC debug features there
RAS/CEC: Dump the different array element sections
RAS/CEC: Rename count_threshold to action_threshold
RAS/CEC: Sanity-check array on every insertion
RAS/CEC: Fix potential memory leak
RAS/CEC: Do not set decay value on error
RAS/CEC: Check count_threshold unconditionally
RAS/CEC: Fix pfn insertion
Pull locking updates from Ingo Molnar:
"The main changes in this cycle are:
- rwsem scalability improvements, phase #2, by Waiman Long, which are
rather impressive:
"On a 2-socket 40-core 80-thread Skylake system with 40 reader
and writer locking threads, the min/mean/max locking operations
done in a 5-second testing window before the patchset were:
40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255
After the patchset, they became:
40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"
There's a lot of changes to the locking implementation that makes
it similar to qrwlock, including owner handoff for more fair
locking.
Another microbenchmark shows how across the spectrum the
improvements are:
"With a locking microbenchmark running on 5.1 based kernel, the
total locking rates (in kops/s) on a 2-socket Skylake system
with equal numbers of readers and writers (mixed) before and
after this patchset were:
# of Threads Before Patch After Patch
------------ ------------ -----------
2 2,618 4,193
4 1,202 3,726
8 802 3,622
16 729 3,359
32 319 2,826
64 102 2,744"
The changes are extensive and the patch-set has been through
several iterations addressing various locking workloads. There
might be more regressions, but unless they are pathological I
believe we want to use this new implementation as the baseline
going forward.
- jump-label optimizations by Daniel Bristot de Oliveira: the primary
motivation was to remove IPI disturbance of isolated RT-workload
CPUs, which resulted in the implementation of batched jump-label
updates. Beyond the improvement of the real-time characteristics
kernel, in one test this patchset improved static key update
overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
as well.
- atomic64_t cross-arch type cleanups by Mark Rutland: over the last
~10 years of atomic64_t existence the various types used by the
APIs only had to be self-consistent within each architecture -
which means they became wildly inconsistent across architectures.
Mark puts and end to this by reworking all the atomic64
implementations to use 's64' as the base type for atomic64_t, and
to ensure that this type is consistently used for parameters and
return values in the API, avoiding further problems in this area.
- A large set of small improvements to lockdep by Yuyang Du: type
cleanups, output cleanups, function return type and othr cleanups
all around the place.
- A set of percpu ops cleanups and fixes by Peter Zijlstra.
- Misc other changes - please see the Git log for more details"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
locking/lockdep: increase size of counters for lockdep statistics
locking/atomics: Use sed(1) instead of non-standard head(1) option
locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
x86/jump_label: Make tp_vec_nr static
x86/percpu: Optimize raw_cpu_xchg()
x86/percpu, sched/fair: Avoid local_clock()
x86/percpu, x86/irq: Relax {set,get}_irq_regs()
x86/percpu: Relax smp_processor_id()
x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
locking/rwsem: Guard against making count negative
locking/rwsem: Adaptive disabling of reader optimistic spinning
locking/rwsem: Enable time-based spinning on reader-owned rwsem
locking/rwsem: Make rwsem->owner an atomic_long_t
locking/rwsem: Enable readers spinning on writer
locking/rwsem: Clarify usage of owner's nonspinaable bit
locking/rwsem: Wake up almost all readers in wait queue
locking/rwsem: More optimal RT task handling of null owner
locking/rwsem: Always release wait_lock before waking up tasks
locking/rwsem: Implement lock handoff to prevent lock starvation
locking/rwsem: Make rwsem_spin_on_owner() return owner state
...
Break out parts of mshyperv.h that are ISA independent into a
separate file in include/asm-generic. This move facilitates
ARM64 code reusing these definitions and avoids code
duplication. No functionality or behavior is changed.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Pull x86 pti updates from Thomas Gleixner:
"The speculative paranoia departement delivers a few more plugs for
possible (probably theoretical) spectre/mds leaks"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tls: Fix possible spectre-v1 in do_get_thread_area()
x86/ptrace: Fix possible spectre-v1 in ptrace_get_debugreg()
x86/speculation/mds: Eliminate leaks by trace_hardirqs_on()
Pull x86 timer updates from Thomas Gleixner:
"A rather large series consolidating the HPET code, which was triggered
by the attempt to bolt HPET NMI watchdog support on to the existing
maze with the usual duct tape and super glue approach.
This mainly removes two separate partially redundant storage layers
and consolidates them into a single one which provides a consistent
view of the different HPET channels and their usage and allows to
integrate HPET NMI watchdog support (if it turns out to be feasible)
in a non intrusive way"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
x86/hpet: Use channel for legacy clockevent storage
x86/hpet: Use common init for legacy clockevent
x86/hpet: Carve out shareable parts of init_one_hpet_msi_clockevent()
x86/hpet: Consolidate clockevent functions
x86/hpet: Wrap legacy clockevent in hpet_channel
x86/hpet: Use cached info instead of extra flags
x86/hpet: Move clockevents into channels
x86/hpet: Rename variables to prepare for switching to channels
x86/hpet: Add function to select a /dev/hpet channel
x86/hpet: Add mode information to struct hpet_channel
x86/hpet: Use cached channel data
x86/hpet: Introduce struct hpet_base and struct hpet_channel
x86/hpet: Coding style cleanup
x86/hpet: Clean up comments
x86/hpet: Make naming consistent
x86/hpet: Remove not required includes
x86/hpet: Decapitalize and rename EVT_TO_HPET_DEV
x86/hpet: Simplify counter validation
x86/hpet: Separate counter check out of clocksource register code
x86/hpet: Shuffle code around for readability sake
...
Pull x86 CPU feature updates from Thomas Gleixner:
"Updates for x86 CPU features:
- Support for UMWAIT/UMONITOR, which allows to use MWAIT and MONITOR
instructions in user space to save power e.g. in HPC workloads
which spin wait on synchronization points.
The maximum time a MWAIT can halt in userspace is controlled by the
kernel and can be adjusted by the sysadmin.
- Speed up the MTRR handling code on CPUs which support cache
self-snooping correctly.
On those CPUs the wbinvd() invocations can be omitted which speeds
up the MTRR setup by a factor of 50.
- Support for the new x86 vendor Zhaoxin who develops processors
based on the VIA Centaur technology.
- Prevent 'cat /proc/cpuinfo' from affecting isolated NOHZ_FULL CPUs
by sending IPIs to retrieve the CPU frequency and use the cached
values instead.
- The addition and late revert of the FSGSBASE support. The revert
was required as it turned out that the code still has hard to
diagnose issues. Yet another engineering trainwreck...
- Small fixes, cleanups, improvements and the usual new Intel CPU
family/model addons"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
x86/fsgsbase: Revert FSGSBASE support
selftests/x86/fsgsbase: Fix some test case bugs
x86/entry/64: Fix and clean up paranoid_exit
x86/entry/64: Don't compile ignore_sysret if 32-bit emulation is enabled
selftests/x86: Test SYSCALL and SYSENTER manually with TF set
x86/mtrr: Skip cache flushes on CPUs with cache self-snooping
x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
Documentation/ABI: Document umwait control sysfs interfaces
x86/umwait: Add sysfs interface to control umwait maximum time
x86/umwait: Add sysfs interface to control umwait C0.2 state
x86/umwait: Initialize umwait control values
x86/cpufeatures: Enumerate user wait instructions
x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs
x86/acpi/cstate: Add Zhaoxin processors support for cache flush policy in C3
ACPI, x86: Add Zhaoxin processors support for NONSTOP TSC
x86/cpu: Create Zhaoxin processors architecture support file
x86/cpu: Split Tremont based Atoms from the rest
Documentation/x86/64: Add documentation for GS/FS addressing mode
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit
...
Pull x86 FPU updates from Thomas Gleixner:
"A small set of updates for the FPU code:
- Make the no387/nofxsr command line options useful by restricting
them to 32bit and actually clearing all dependencies to prevent
random crashes and malfunction.
- Simplify and cleanup the kernel_fpu_*() helpers"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Inline fpu__xstate_clear_all_cpu_caps()
x86/fpu: Make 'no387' and 'nofxsr' command line options useful
x86/fpu: Remove the fpu__save() export
x86/fpu: Simplify kernel_fpu_begin()
x86/fpu: Simplify kernel_fpu_end()
Pull x86 vsyscall updates from Thomas Gleixner:
"Further hardening of the legacy vsyscall by providing support for
execute only mode and switching the default to it.
This prevents a certain class of attacks which rely on the vsyscall
page being accessible at a fixed address in the canonical kernel
address space"
* 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86: Add a test for process_vm_readv() on the vsyscall page
x86/vsyscall: Add __ro_after_init to global variables
x86/vsyscall: Change the default vsyscall mode to xonly
selftests/x86/vsyscall: Verify that vsyscall=none blocks execution
x86/vsyscall: Document odd SIGSEGV error code for vsyscalls
x86/vsyscall: Show something useful on a read fault
x86/vsyscall: Add a new vsyscall=xonly mode
Documentation/admin: Remove the vsyscall=native documentation
Pull x96 apic updates from Thomas Gleixner:
"Updates for the x86 APIC interrupt handling and APIC timer:
- Fix a long standing issue with spurious interrupts which was caused
by the big vector management rework a few years ago. Robert Hodaszi
provided finally enough debug data and an excellent initial failure
analysis which allowed to understand the underlying issues.
This contains a change to the core interrupt management code which
is required to handle this correctly for the APIC/IO_APIC. The core
changes are NOOPs for most architectures except ARM64. ARM64 is not
impacted by the change as confirmed by Marc Zyngier.
- Newer systems allow to disable the PIT clock for power saving
causing panic in the timer interrupt delivery check of the IO/APIC
when the HPET timer is not enabled either. While the clock could be
turned on this would cause an endless whack a mole game to chase
the proper register in each affected chipset.
These systems provide the relevant frequencies for TSC, CPU and the
local APIC timer via CPUID and/or MSRs, which allows to avoid the
PIT/HPET based calibration. As the calibration code is the only
usage of the legacy timers on modern systems and is skipped anyway
when the frequencies are known already, there is no point in
setting up the PIT and actually checking for the interrupt delivery
via IO/APIC.
To achieve this on a wide variety of platforms, the CPUID/MSR based
frequency readout has been made more robust, which also allowed to
remove quite some workarounds which turned out to be not longer
required. Thanks to Daniel Drake for analysis, patches and
verification"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Seperate unused system vectors from spurious entry again
x86/irq: Handle spurious interrupt after shutdown gracefully
x86/ioapic: Implement irq_get_irqchip_state() callback
genirq: Add optional hardware synchronization for shutdown
genirq: Fix misleading synchronize_irq() documentation
genirq: Delay deactivation in free_irq()
x86/timer: Skip PIT initialization on modern chipsets
x86/apic: Use non-atomic operations when possible
x86/apic: Make apic_bsp_setup() static
x86/tsc: Set LAPIC timer period to crystal clock frequency
x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
Pull timer updates from Thomas Gleixner:
"The timer and timekeeping departement delivers:
Core:
- The consolidation of the VDSO code into a generic library including
the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
route through the relevant maintainer trees and should end up in
5.4.
This gets rid of the unnecessary different copies of the same code
and brings all architectures on the same level of VDSO
functionality.
- Make the NTP user space interface more robust by restricting the
TAI offset to prevent undefined behaviour. Includes a selftest.
- Validate user input in the compat settimeofday() syscall to catch
invalid values which would be turned into valid values by a
multiplication overflow
- Consolidate the time accessors
- Small fixes, improvements and cleanups all over the place
Drivers:
- Support for the NXP system counter, TI davinci timer
- Move the Microsoft HyperV clocksource/events code into the
drivers/clocksource directory so it can be shared between x86 and
ARM64.
- Overhaul of the Tegra driver
- Delay timer support for IXP4xx
- Small fixes, improvements and cleanups as usual"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
time: Validate user input in compat_settimeofday()
timer: Document TIMER_PINNED
clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
clocksource/drivers: Make Hyper-V clocksource ISA agnostic
MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
hrtimer: Use a bullet for the returns bullet list
arm64: vdso: Fix compilation with clang older than 8
arm64: compat: Fix __arch_get_hw_counter() implementation
arm64: Fix __arch_get_hw_counter() implementation
lib/vdso: Make delta calculation work correctly
MAINTAINERS: Add entry for the generic VDSO library
arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
arm64: vdso: Remove unnecessary asm-offsets.c definitions
vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
clocksource/drivers/davinci: Add support for clocksource
clocksource/drivers/davinci: Add support for clockevents
clocksource/drivers/tegra: Set up maximum-ticks limit properly
clocksource/drivers/tegra: Cycles can't be 0
clocksource/drivers/tegra: Restore base address before cleanup
clocksource/drivers/tegra: Add verbose definition for 1MHz constant
...
Pull SMP/hotplug updates from Thomas Gleixner:
"A small set of updates for SMP and CPU hotplug:
- Abort disabling secondary CPUs in the freezer when a wakeup is
pending instead of evaluating it only after all CPUs have been
offlined.
- Remove the shared annotation for the strict per CPU cfd_data in the
smp function call core code.
- Remove the return values of smp_call_function() and on_each_cpu()
as they are unconditionally 0. Fixup the few callers which actually
bothered to check the return value"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Remove smp_call_function() and on_each_cpu() return values
smp: Do not mark call_function_data as shared
cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
cpu/hotplug: Fix notify_cpu_starting() reference in bringup_wait_for_ap()
Pull arm64 updates from Catalin Marinas:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
- Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
manage the permissions of executable vmalloc regions more strictly
- Slight performance improvement by keeping softirqs enabled while
touching the FPSIMD/SVE state (kernel_neon_begin/end)
- Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
XAFLAG and AXFLAG instructions for floating point comparison flags
manipulation) and FRINT (rounding floating point numbers to integers)
- Re-instate ARM64_PSEUDO_NMI support which was previously marked as
BROKEN due to some bugs (now fixed)
- Improve parking of stopped CPUs and implement an arm64-specific
panic_smp_self_stop() to avoid warning on not being able to stop
secondary CPUs during panic
- perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
platforms
- perf: DDR performance monitor support for iMX8QXP
- cache_line_size() can now be set from DT or ACPI/PPTT if provided to
cope with a system cache info not exposed via the CPUID registers
- Avoid warning on hardware cache line size greater than
ARCH_DMA_MINALIGN if the system is fully coherent
- arm64 do_page_fault() and hugetlb cleanups
- Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
- Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
'arm_boot_flags' introduced in 5.1)
- CONFIG_RANDOMIZE_BASE now enabled in defconfig
- Allow the selection of ARM64_MODULE_PLTS, currently only done via
RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
over into the vmalloc area
- Make ZONE_DMA32 configurable
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
perf: arm_spe: Enable ACPI/Platform automatic module loading
arm_pmu: acpi: spe: Add initial MADT/SPE probing
ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
ACPI/PPTT: Modify node flag detection to find last IDENTICAL
x86/entry: Simplify _TIF_SYSCALL_EMU handling
arm64: rename dump_instr as dump_kernel_instr
arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
arm64: Implement panic_smp_self_stop()
arm64: Improve parking of stopped CPUs
arm64: Expose FRINT capabilities to userspace
arm64: Expose ARMv8.5 CondM capability to userspace
arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
arm64: ARM64_MODULES_PLTS must depend on MODULES
arm64: bpf: do not allocate executable memory
arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
arm64: module: create module allocations without exec permissions
arm64: Allow user selection of ARM64_MODULE_PLTS
acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
arm64: Allow selecting Pseudo-NMI again
...
The command line option `no387' is designed to disable the FPU
entirely. This only 'works' with CONFIG_MATH_EMULATION enabled.
But on 64bit this cannot work because user space expects SSE to work which
required basic FPU support. MATH_EMULATION does not help because SSE is not
emulated.
The command line option `nofxsr' should also be limited to 32bit because
FXSR is part of the required flags on 64bit so turning it off is not
possible.
Clearing X86_FEATURE_FPU without emulation enabled will not work anyway and
hang in fpu__init_system_early_generic() before the console is enabled.
Setting additioal dependencies, ensures that the CPU still boots on a
modern CPU. Otherwise, dropping FPU will leave FXSR enabled causing the
kernel to crash early in fpu__init_system_mxcsr().
With XSAVE support it will crash in fpu__init_cpu_xstate(). The problem is
that xsetbv() with XMM set and SSE cleared is not allowed. That means
XSAVE has to be disabled. The XSAVE support is disabled in
fpu__init_system_xstate_size_legacy() but it is too late. It can be
removed, it has been added in commit
1f999ab5a1 ("x86, xsave: Disable xsave in i387 emulation mode")
to use `no387' on a CPU with XSAVE support.
All this happens before console output.
After hat, the next possible crash is in RAID6 detect code because MMX
remained enabled. With a 3DNOW enabled config it will explode in memcpy()
for instance due to kernel_fpu_begin() but this is unconditionally enabled.
This is enough to boot a Debian Wheezy on a 32bit qemu "host" CPU which
supports everything up to XSAVES, AVX2 without 3DNOW. Later, Debian
increased the minimum requirements to i686 which means it does not boot
userland atleast due to CMOV.
After masking the additional features it still keeps SSE4A and 3DNOW*
enabled (if present on the host) but those are unused in the kernel.
Restrict `no387' and `nofxsr' otions to 32bit only. Add dependencies for
FPU, FXSR to additionaly mask CMOV, MMX, XSAVE if FXSR or FPU is cleared.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190703083247.57kjrmlxkai3vpw3@linutronix.de
Pull kvm fixes from Paolo Bonzini:
"x86 bugfix patches and one compilation fix for ARM"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64/sve: Fix vq_present() macro to yield a bool
KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC
KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS
KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM
KVM: x86: degrade WARN to pr_warn_ratelimited
Replace a magic 64-bit mask with a list of valid registers, computing
the same mask in the end.
Suggested-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm-unit-tests were adjusted to match bare metal behavior, but KVM
itself was not doing what bare metal does; fix that.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Checks on Host Segment and Descriptor-Table
Registers" in Intel SDM vol 3C, the following checks are performed on
vmentry of nested guests:
- In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the
RPL (bits 1:0) and the TI flag (bit 2) must be 0.
- The selector fields for CS and TR cannot be 0000H.
- The selector field for SS cannot be 0000H if the "host address-space
size" VM-exit control is 0.
- On processors that support Intel 64 architecture, the base-address
fields for FS, GS and TR must contain canonical addresses.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM does not have 100% coverage of VMX consistency checks, i.e. some
checks that cause VM-Fail may only be detected by hardware during a
nested VM-Entry. In such a case, KVM must restore L1's state to the
pre-VM-Enter state as L2's state has already been loaded into KVM's
software model.
L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But
when EPT is disabled, the associated fields hold KVM's shadow values,
not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs
come from memory, i.e. are not cached in the VMCS. Which leaves CR3
as the sole anomaly.
A previously applied workaround to handle CR3 was to force nested early
checks if EPT is disabled:
commit 2b27924bb1 ("KVM: nVMX: always use early vmcs check when EPT
is disabled")
Forcing nested early checks is undesirable as doing so adds hundreds of
cycles to every nested VM-Entry. Rather than take this performance hit,
handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested
VM-Entry when EPT is disabled *and* nested early checks are disabled.
By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will
naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3.
These shenanigans work because nested_vmx_restore_host_state() does a
full kvm_mmu_reset_context(), i.e. unloads the current MMU, which
guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3
prior to re-entering L1.
vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via:
nested_vmx_restore_host_state() ->
kvm_mmu_reset_context() ->
kvm_mmu_unload() ->
kvm_mmu_free_roots()
kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank
on 'root_hpa == INVALID_PAGE' unless the implementation of
kvm_mmu_reset_context() is changed.
On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a
successful entry) via:
vcpu_enter_guest() ->
kvm_mmu_reload() ->
kvm_mmu_load() ->
kvm_mmu_load_cr3() ->
vmx_set_cr3()
Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled
as a "late" VM-Fail should never happen win that case (KVM WARNs), and
the conditional write avoids the need to restore the correct GUEST_CR3
when nested_vmx_check_vmentry_hw() fails.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Note that in such a case it is quite likely that KVM will BUG_ON
in __pte_list_remove when the VM is closed. However, there is no
immediate risk of memory corruption in the host so a WARN_ON is
enough and it lets you gather traces for debugging.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After the previous patch, the low bits of the gfn are masked in
both FNAME(fetch) and __direct_map, so we do not need to clear them
in transparent_hugepage_adjust.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These two functions are basically doing the same thing through
kvm_mmu_get_page, link_shadow_page and mmu_set_spte; yet, for historical
reasons, their code looks very different. This patch tries to take the
best of each and make them very similar, so that it is easy to understand
changes that apply to both of them.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>