Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to
avoid clobbering flags.
KVM's emulator makes indirect calls into a jump table of sorts, where
the destination of the CALL_NOSPEC is a small blob of code that performs
fast emulation by executing the target instruction with fixed operands.
adcb_al_dl:
0x000339f8 <+0>: adc %dl,%al
0x000339fa <+2>: ret
A major motiviation for doing fast emulation is to leverage the CPU to
handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is
both an input and output to the target of CALL_NOSPEC. Clobbering flags
results in all sorts of incorrect emulation, e.g. Jcc instructions often
take the wrong path. Sans the nops...
asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n"
0x0003595a <+58>: mov 0xc0(%ebx),%eax
0x00035960 <+64>: mov 0x60(%ebx),%edx
0x00035963 <+67>: mov 0x90(%ebx),%ecx
0x00035969 <+73>: push %edi
0x0003596a <+74>: popf
0x0003596b <+75>: call *%esi
0x000359a0 <+128>: pushf
0x000359a1 <+129>: pop %edi
0x000359a2 <+130>: mov %eax,0xc0(%ebx)
0x000359b1 <+145>: mov %edx,0x60(%ebx)
ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK);
0x000359a8 <+136>: mov -0x10(%ebp),%eax
0x000359ab <+139>: and $0x8d5,%edi
0x000359b4 <+148>: and $0xfffff72a,%eax
0x000359b9 <+153>: or %eax,%edi
0x000359bd <+157>: mov %edi,0x4(%ebx)
For the most part this has gone unnoticed as emulation of guest code
that can trigger fast emulation is effectively limited to MMIO when
running on modern hardware, and MMIO is rarely, if ever, accessed by
instructions that affect or consume flags.
Breakage is almost instantaneous when running with unrestricted guest
disabled, in which case KVM must emulate all instructions when the guest
has invalid state, e.g. when the guest is in Big Real Mode during early
BIOS.
Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support")
Fixes: 1a29b5b7f3 ("KVM: x86: Make indirect calls in emulator speculation safe")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.com
There is no particular reason to not enable TSC page clocksource on
32-bit. mul_u64_u64_shr() is available and despite the increased
computational complexity (compared to 64bit) TSC page is still a huge win
compared to MSR-based clocksource.
In-kernel reads:
MSR based clocksource: 3361 cycles
TSC page clocksource: 49 cycles
Reads from userspace (utilizing vDSO in case of TSC page):
MSR based clocksource: 5664 cycles
TSC page clocksource: 131 cycles
Enabling TSC page on 32bits allows to get rid of CONFIG_HYPERV_TSCPAGE as
it is now not any different from CONFIG_HYPERV_TIMER.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20190822083630.17059-1-vkuznets@redhat.com
Hyper-V guests use the default native_sched_clock() in
pv_ops.time.sched_clock on x86. But native_sched_clock() directly uses the
raw TSC value, which can be discontinuous in a Hyper-V VM.
Add the generic hv_setup_sched_clock() to set the sched clock function
appropriately. On x86, this sets pv_ops.time.sched_clock to read the
Hyper-V reference TSC value that is scaled and adjusted to be continuous.
Also move the Hyper-V reference TSC initialization much earlier in the boot
process so no discontinuity is observed when pv_ops.time.sched_clock
calculates its offset.
[ tglx: Folded build fix ]
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20190814123216.32245-3-Tianyu.Lan@microsoft.com
Prepare to add Hyper-V sched clock callback and move Hyper-V Reference TSC
initialization much earlier in the boot process. Earlier initialization is
needed so that it happens while the timestamp value is still 0 and no
discontinuity in the timestamp will occur when pv_ops.time.sched_clock
calculates its offset.
The earlier initialization requires that the Hyper-V TSC page be allocated
statically instead of with vmalloc(), so fixup the references to the TSC
page and the method of getting its physical address.
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lkml.kernel.org/r/20190814123216.32245-2-Tianyu.Lan@microsoft.com
This variable has no users anymore. Remove it and tell the
IOMMU code via its new functions about requested DMA modes.
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Pull RCU and LKMM changes from Paul E. McKenney:
- A few more RCU flavor consolidation cleanups.
- Miscellaneous fixes.
- Updates to RCU's list-traversal macros improving lockdep usability.
- Torture-test updates.
- Forward-progress improvements for no-CBs CPUs: Avoid ignoring
incoming callbacks during grace-period waits.
- Forward-progress improvements for no-CBs CPUs: Use ->cblist
structure to take advantage of others' grace periods.
- Also added a small commit that avoids needlessly inflicting
scheduler-clock ticks on callback-offloaded CPUs.
- Forward-progress improvements for no-CBs CPUs: Reduce contention
on ->nocb_lock guarding ->cblist.
- Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
list to further reduce contention on ->nocb_lock guarding ->cblist.
- LKMM updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix an incorrect/stale comment regarding the vmx_vcpu pointer, as guest
registers are now loaded using a direct pointer to the start of the
register array.
Opportunistically add a comment to document why the vmx_vcpu pointer is
needed, its consumption via 'call vmx_update_host_rsp' is rather subtle.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM implementations that wrap struct kvm_vcpu with a vendor specific
struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
will instead overlap random chunks of the vendor specific struct.
E.g. padding a large number of bytes before struct kvm_vcpu triggers
a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove a few stale checks for non-NULL ops now that the ops in question
are implemented by both VMX and SVM.
Note, this is **not** stable material, the Fixes tags are there purely
to show when a particular op was first supported by both VMX and SVM.
Fixes: 74f169090b ("kvm/svm: Setup MCG_CAP on AMD properly")
Fixes: b31c114b82 ("KVM: X86: Provide a capability to disable PAUSE intercepts")
Fixes: 411b44ba80 ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the open-coded "is MMIO SPTE" checks in the MMU warnings
related to software-based access/dirty tracking to make the code
slightly more self-documenting.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When shadow paging is enabled, KVM tracks the allowed access type for
MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit
without having to walk the guest's page tables. The tracking is done
by retaining the WRITE and USER bits of the access when inserting the
MMIO SPTE (read access is implicitly allowed), which allows the MMIO
page fault handler to retrieve and cache the WRITE/USER bits from the
SPTE.
Unfortunately for EPT, the mask used to retain the WRITE/USER bits is
hardcoded using the x86 paging versions of the bits. This funkiness
happens to work because KVM uses a completely different mask/value for
MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to
overlap exactly with the x86 WRITE/USER bits[*].
Explicitly define the access mask for MMIO SPTEs to accurately reflect
that EPT does not want to incorporate any access bits into the SPTE, and
so that KVM isn't subtly relying on EPT's WX bits always being set in
MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks
horribly.
Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all
TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for
EPT has no (known) functional impact.
[*] Using WX to generate EPT misconfigurations (equivalent to reserved
bit page fault) ensures KVM can employ its MMIO page fault tricks
even platforms without reserved address bits.
Fixes: ce88decffd ("KVM: MMU: mmio page fault support")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "access" to "mmio_access" to match the other MMIO cache members
and to make it more obvious that it's tracking the access permissions
for the MMIO cache.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Just like we do with other intercepts, in vmrun_interception() we should be
doing kvm_skip_emulated_instruction() and not just RIP += 3. Also, it is
wrong to increment RIP before nested_svm_vmrun() as it can result in
kvm_inject_gp().
We can't call kvm_skip_emulated_instruction() after nested_svm_vmrun() so
move it inside.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Regardless of whether or not nested_svm_vmrun_msrpm() fails, we return 1
from vmrun_interception() so there's no point in doing goto. Also,
nested_svm_vmrun_msrpm() call can be made from nested_svm_vmrun() where
other nested launch issues are handled.
nested_svm_vmrun() returns a bool, however, its result is ignored in
vmrun_interception() as we always return '1'. As a preparatory change
to putting kvm_skip_emulated_instruction() inside nested_svm_vmrun()
make nested_svm_vmrun() return an int (always '1' for now).
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Various intercepts hard-code the respective instruction lengths to optimize
skip_emulated_instruction(): when next_rip is pre-set we skip
kvm_emulate_instruction(vcpu, EMULTYPE_SKIP). The optimization is, however,
incorrect: different (redundant) prefixes could be used to enlarge the
instruction. We can't really avoid decoding.
svm->next_rip is not used when CPU supports 'nrips' (X86_FEATURE_NRIPS)
feature: next RIP is provided in VMCB. The feature is not really new
(Opteron G3s had it already) and the change should have zero affect.
Remove manual svm->next_rip setting with hard-coded instruction lengths.
The only case where we now use svm->next_rip is EXIT_IOIO: the instruction
length is provided to us by hardware.
Hardcoded RIP advancement remains in vmrun_interception(), this is going to
be taken care of separately.
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To avoid hardcoding xsetbv length to '3' we need to support decoding it in
the emulator.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When doing x86_emulate_instruction(EMULTYPE_SKIP) interrupt shadow has to
be cleared if and only if the skipping is successful.
There are two immediate issues:
- In SVM skip_emulated_instruction() we are not zapping interrupt shadow
in case kvm_emulate_instruction(EMULTYPE_SKIP) is used to advance RIP
(!nrpip_save).
- In VMX handle_ept_misconfig() when running as a nested hypervisor we
(static_cpu_has(X86_FEATURE_HYPERVISOR) case) forget to clear interrupt
shadow.
Note that we intentionally don't handle the case when the skipped
instruction is supposed to prolong the interrupt shadow ("MOV/POP SS") as
skip-emulation of those instructions should not happen under normal
circumstances.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
Currently, we only do printk(KERN_DEBUG) when this happens and this
is not ideal. Propagate the error up the stack.
On VMX, skip_emulated_instruction() doesn't fail, we have two call
sites calling it explicitly: handle_exception_nmi() and
handle_task_switch(), we can just ignore the result.
On SVM, we also have two explicit call sites:
svm_queue_exception() and it seems we don't need to do anything there as
we check if RIP was advanced or not. In task_switch_interception(),
however, we are better off not proceeding to kvm_task_switch() in case
skip_emulated_instruction() failed.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm->next_rip is only used by skip_emulated_instruction() and in case
kvm_set_msr() fails we rightfully don't do that. Move svm->next_rip
advancement to 'else' branch to avoid creating false impression that
it's always advanced (and make it look like rdmsr_interception()).
This is a preparatory change to removing hardcoded RIP advancement
from instruction intercepts, no functional change.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jump to the common error handling in x86_decode_insn() if
__do_insn_fetch_bytes() fails so that its error code is converted to the
appropriate return type. Although the various helpers used by
x86_decode_insn() return X86EMUL_* values, x86_decode_insn() itself
returns EMULATION_FAILED or EMULATION_OK.
This doesn't cause a functional issue as the sole caller,
x86_emulate_instruction(), currently only cares about success vs.
failure, and success is indicated by '0' for both types
(X86EMUL_CONTINUE and EMULATION_OK).
Fixes: 285ca9e948 ("KVM: emulate: speed up do_insn_fetch")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to AMD bits, set the Intel bits from the vendor-independent
feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care
about the vendor and they should be set on AMD processors as well.
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even though it is preferrable to use SPEC_CTRL (represented by
X86_FEATURE_AMD_SSBD) instead of VIRT_SPEC, VIRT_SPEC is always
supported anyway because otherwise it would be impossible to
migrate from old to new CPUs. Make this apparent in the
result of KVM_GET_SUPPORTED_CPUID as well.
However, we need to hide the bit on Intel processors, so move
the setting to svm_set_supported_cpuid.
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The AMD_* bits have to be set from the vendor-independent
feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care
about the vendor and they should be set on Intel processors as well.
On top of this, SSBD, STIBP and AMD_SSB_NO bit were not set, and
VIRT_SSBD does not have to be added manually because it is a
cpufeature that comes directly from the host's CPUID bit.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before this commit lib/crypto/sha256.c has only been used in the s390 and
x86 purgatory code, make it suitable for generic use:
* Export interesting symbols
* Add -D__DISABLE_EXPORTS to CFLAGS_sha256.o for purgatory builds to
avoid the exports for the purgatory builds
* Add to lib/crypto/Makefile and crypto/Kconfig
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Generic crypto implementations belong under lib/crypto not directly in
lib, likewise the header should be in include/crypto, not include/linux.
Note that the code in lib/crypto/sha256.c is not yet available for
generic use after this commit, it is still only used by the s390 and x86
purgatory code. Making it suitable for generic use is done in further
patches in this series.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Align the x86 code with the generic XTS template, which now supports
ciphertext stealing as described by the IEEE XTS-AES spec P1619.
Tested-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Another one for the cipher museum: split off DES core processing into
a separate module so other drivers (mostly for crypto accelerators)
can reuse the code without pulling in the generic DES cipher itself.
This will also permit the cipher interface to be made private to the
crypto API itself once we move the only user in the kernel (CIFS) to
this library interface.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation of moving the shared key expansion routine into the
DES library, move the verification done by __des3_ede_setkey() into
its callers.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull KVM fixes from Paolo Bonzini:
"A couple bugfixes, and mostly selftests changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
selftests/kvm: make platform_info_test pass on AMD
Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"
selftests: kvm: fix state save/load on processors without XSAVE
selftests: kvm: fix vmx_set_nested_state_test
selftests: kvm: provide common function to enable eVMCS
selftests: kvm: do not try running the VM in vmx_set_nested_state_test
KVM: x86: svm: remove redundant assignment of var new_entry
MAINTAINERS: add KVM x86 reviewers
MAINTAINERS: change list for KVM/s390
kvm: x86: skip populating logical dest map if apic is not sw enabled
Add CONFIG_ASM_MODVERSIONS. This allows to remove one if-conditional
nesting in scripts/Makefile.build.
scripts/Makefile.build is run every time Kbuild descends into a
sub-directory. So, I want to avoid $(wildcard ...) evaluation
where possible although computing $(wildcard ...) is so cheap that
it may not make measurable performance difference.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
This reverts commit 4e103134b8.
Alex Williamson reported regressions with device assignment with
this patch. Even though the bug is probably elsewhere and still
latent, this is needed to fix the regression.
Fixes: 4e103134b8 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot", 2019-02-05)
Reported-by: Alex Willamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The testmmiotrace module shouldn't be permitted when the kernel is locked
down as it can be used to arbitrarily read and write MMIO space. This is
a runtime check rather than buildtime in order to allow configurations
where the same kernel may be run in both locked down or permissive modes
depending on local policy.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Ingo Molnar <mingo@kernel.org>
cc: "H. Peter Anvin" <hpa@zytor.com>
cc: x86@kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
This option allows userspace to pass the RSDP address to the kernel, which
makes it possible for a user to modify the workings of hardware. Reject
the option when the kernel is locked down. This requires some reworking
of the existing RSDP command line logic, since the early boot code also
makes use of a command-line passed RSDP when locating the SRAT table
before the lockdown code has been initialised. This is achieved by
separating the command line RSDP path in the early boot code from the
generic RSDP path, and then copying the command line RSDP into boot
params in the kernel proper if lockdown is not enabled. If lockdown is
enabled and an RSDP is provided on the command line, this will only be
used when parsing SRAT (which shouldn't permit kernel code execution)
and will be ignored in the rest of the kernel.
(Modified by Matthew Garrett in order to handle the early boot RSDP
environment)
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: Dave Young <dyoung@redhat.com>
cc: linux-acpi@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
Writing to MSRs should not be allowed if the kernel is locked down, since
it could lead to execution of arbitrary code in kernel mode. Based on a
patch by Kees Cook.
Signed-off-by: Matthew Garrett <mjg59@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
cc: x86@kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
IO port access would permit users to gain access to PCI configuration
registers, which in turn (on a lot of hardware) give access to MMIO
register space. This would potentially permit root to trigger arbitrary
DMA, so lock it down by default.
This also implicitly locks down the KDADDIO, KDDELIO, KDENABIO and
KDDISABIO console ioctls.
Signed-off-by: Matthew Garrett <mjg59@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: x86@kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
This is a preparatory patch for kexec_file_load() lockdown. A locked down
kernel needs to prevent unsigned kernel images from being loaded with
kexec_file_load(). Currently, the only way to force the signature
verification is compiling with KEXEC_VERIFY_SIG. This prevents loading
usigned images even when the kernel is not locked down at runtime.
This patch splits KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE.
Analogous to the MODULE_SIG and MODULE_SIG_FORCE for modules, KEXEC_SIG
turns on the signature verification but allows unsigned images to be
loaded. KEXEC_SIG_FORCE disallows images without a valid signature.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
Kexec reboot in case secure boot being enabled does not keep the secure
boot mode in new kernel, so later one can load unsigned kernel via legacy
kexec_load. In this state, the system is missing the protections provided
by secure boot.
Adding a patch to fix this by retain the secure_boot flag in original
kernel.
secure_boot flag in boot_params is set in EFI stub, but kexec bypasses the
stub. Fixing this issue by copying secure_boot flag across kexec reboot.
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
Both the 64bit and the 32bit handle_irq() implementation check the irq
descriptor pointer with IS_ERR_OR_NULL() and return failure. That can be
done simpler in the common do_IRQ() code.
This reduces the 64bit handle_irq() function to a wrapper around
generic_handle_irq_desc(). Invoke it directly from do_IRQ() to spare the
extra function call.
[ tglx: Got rid of the #ifdef and massaged changelog ]
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/2ec758c7-9aaa-73ab-f083-cc44c86aa741@gmail.com
There are 2 problems with the old iosf PMIC I2C bus arbritration code which
need to be addressed:
1. The lockdep code complains about a possible deadlock in the
iosf_mbi_[un]block_punit_i2c_access code:
[ 6.712662] ======================================================
[ 6.712673] WARNING: possible circular locking dependency detected
[ 6.712685] 5.3.0-rc2+ #79 Not tainted
[ 6.712692] ------------------------------------------------------
[ 6.712702] kworker/0:1/7 is trying to acquire lock:
[ 6.712712] 00000000df1c5681 (iosf_mbi_block_punit_i2c_access_count_mutex){+.+.}, at: iosf_mbi_unblock_punit_i2c_access+0x13/0x90
[ 6.712739]
but task is already holding lock:
[ 6.712749] 0000000067cb23e7 (iosf_mbi_punit_mutex){+.+.}, at: iosf_mbi_block_punit_i2c_access+0x97/0x186
[ 6.712768]
which lock already depends on the new lock.
[ 6.712780]
the existing dependency chain (in reverse order) is:
[ 6.712792]
-> #1 (iosf_mbi_punit_mutex){+.+.}:
[ 6.712808] __mutex_lock+0xa8/0x9a0
[ 6.712818] iosf_mbi_block_punit_i2c_access+0x97/0x186
[ 6.712831] i2c_dw_acquire_lock+0x20/0x30
[ 6.712841] i2c_dw_set_reg_access+0x15/0xb0
[ 6.712851] i2c_dw_probe+0x57/0x473
[ 6.712861] dw_i2c_plat_probe+0x33e/0x640
[ 6.712874] platform_drv_probe+0x38/0x80
[ 6.712884] really_probe+0xf3/0x380
[ 6.712894] driver_probe_device+0x59/0xd0
[ 6.712905] bus_for_each_drv+0x84/0xd0
[ 6.712915] __device_attach+0xe4/0x170
[ 6.712925] bus_probe_device+0x9f/0xb0
[ 6.712935] deferred_probe_work_func+0x79/0xd0
[ 6.712946] process_one_work+0x234/0x560
[ 6.712957] worker_thread+0x50/0x3b0
[ 6.712967] kthread+0x10a/0x140
[ 6.712977] ret_from_fork+0x3a/0x50
[ 6.712986]
-> #0 (iosf_mbi_block_punit_i2c_access_count_mutex){+.+.}:
[ 6.713004] __lock_acquire+0xe07/0x1930
[ 6.713015] lock_acquire+0x9d/0x1a0
[ 6.713025] __mutex_lock+0xa8/0x9a0
[ 6.713035] iosf_mbi_unblock_punit_i2c_access+0x13/0x90
[ 6.713047] i2c_dw_set_reg_access+0x4d/0xb0
[ 6.713058] i2c_dw_probe+0x57/0x473
[ 6.713068] dw_i2c_plat_probe+0x33e/0x640
[ 6.713079] platform_drv_probe+0x38/0x80
[ 6.713089] really_probe+0xf3/0x380
[ 6.713099] driver_probe_device+0x59/0xd0
[ 6.713109] bus_for_each_drv+0x84/0xd0
[ 6.713119] __device_attach+0xe4/0x170
[ 6.713129] bus_probe_device+0x9f/0xb0
[ 6.713140] deferred_probe_work_func+0x79/0xd0
[ 6.713150] process_one_work+0x234/0x560
[ 6.713160] worker_thread+0x50/0x3b0
[ 6.713170] kthread+0x10a/0x140
[ 6.713180] ret_from_fork+0x3a/0x50
[ 6.713189]
other info that might help us debug this:
[ 6.713202] Possible unsafe locking scenario:
[ 6.713212] CPU0 CPU1
[ 6.713221] ---- ----
[ 6.713229] lock(iosf_mbi_punit_mutex);
[ 6.713239] lock(iosf_mbi_block_punit_i2c_access_count_mutex);
[ 6.713253] lock(iosf_mbi_punit_mutex);
[ 6.713265] lock(iosf_mbi_block_punit_i2c_access_count_mutex);
[ 6.713276]
*** DEADLOCK ***
In practice can never happen because only the first caller which
increments iosf_mbi_block_punit_i2c_access_count will also take
iosf_mbi_punit_mutex, that is the whole purpose of the counter, which
itself is protected by iosf_mbi_block_punit_i2c_access_count_mutex.
But there is no way to tell the lockdep code about this and we really
want to be able to run a kernel with lockdep enabled without these
warnings being triggered.
2. The lockdep warning also points out another real problem, if 2 threads
both are in a block of code protected by iosf_mbi_block_punit_i2c_access
and the first thread to acquire the block exits before the second thread
then the second thread will call mutex_unlock on iosf_mbi_punit_mutex,
but it is not the thread which took the mutex and unlocking by another
thread is not allowed.
Fix this by getting rid of the notion of holding a mutex for the entire
duration of the PMIC accesses, be it either from the PUnit side, or from an
in kernel I2C driver. In general holding a mutex after exiting a function
is a bad idea and the above problems show this case is no different.
Instead 2 counters are now used, one for PMIC accesses from the PUnit
and one for accesses from in kernel I2C code. When access is requested
now the code will wait (using a waitqueue) for the counter of the other
type of access to reach 0 and on release, if the counter reaches 0 the
wakequeue is woken.
Note that the counter approach is necessary to allow nested calls.
The main reason for this is so that a series of i2c transfers can be done
with the punit blocked from accessing the bus the whole time. This is
necessary to be able to safely read/modify/write a PMIC register without
racing with the PUNIT doing the same thing.
Allowing nested iosf_mbi_block_punit_i2c_access() calls also is desirable
from a performance pov since the whole dance necessary to block the PUnit
from accessing the PMIC I2C bus is somewhat expensive.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lkml.kernel.org/r/20190812102113.95794-1-hdegoede@redhat.com
Pull networking fixes from David Miller:
1) Fix jmp to 1st instruction in x64 JIT, from Alexei Starovoitov.
2) Severl kTLS fixes in mlx5 driver, from Tariq Toukan.
3) Fix severe performance regression due to lack of SKB coalescing of
fragments during local delivery, from Guillaume Nault.
4) Error path memory leak in sch_taprio, from Ivan Khoronzhuk.
5) Fix batched events in skbedit packet action, from Roman Mashak.
6) Propagate VLAN TX offload to hw_enc_features in bond and team
drivers, from Yue Haibing.
7) RXRPC local endpoint refcounting fix and read after free in
rxrpc_queue_local(), from David Howells.
8) Fix endian bug in ibmveth multicast list handling, from Thomas
Falcon.
9) Oops, make nlmsg_parse() wrap around the correct function,
__nlmsg_parse not __nla_parse(). Fix from David Ahern.
10) Memleak in sctp_scend_reset_streams(), fro Zheng Bin.
11) Fix memory leak in cxgb4, from Wenwen Wang.
12) Yet another race in AF_PACKET, from Eric Dumazet.
13) Fix false detection of retransmit failures in tipc, from Tuong
Lien.
14) Use after free in ravb_tstamp_skb, from Tho Vu.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
ravb: Fix use-after-free ravb_tstamp_skb
netfilter: nf_tables: map basechain priority to hardware priority
net: sched: use major priority number as hardware priority
wimax/i2400m: fix a memory leak bug
net: cavium: fix driver name
ibmvnic: Unmap DMA address of TX descriptor buffers after use
bnxt_en: Fix to include flow direction in L2 key
bnxt_en: Use correct src_fid to determine direction of the flow
bnxt_en: Suppress HWRM errors for HWRM_NVM_GET_VARIABLE command
bnxt_en: Fix handling FRAG_ERR when NVM_INSTALL_UPDATE cmd fails
bnxt_en: Improve RX doorbell sequence.
bnxt_en: Fix VNIC clearing logic for 57500 chips.
net: kalmia: fix memory leaks
cx82310_eth: fix a memory leak bug
bnx2x: Fix VF's VLAN reconfiguration in reload.
Bluetooth: Add debug setting for changing minimum encryption key size
tipc: fix false detection of retransmit failures
lan78xx: Fix memory leaks
MAINTAINERS: r8169: Update path to the driver
MAINTAINERS: PHY LIBRARY: Update files in the record
...
BIOS on Samsung 500C Chromebook reports very rudimentary E820 table that
consists of 2 entries:
BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] usable
BIOS-e820: [mem 0x00000000fffff000-0x00000000ffffffff] reserved
It breaks logic in find_trampoline_placement(): bios_start lands on the
end of the first 4k page and trampoline start gets placed below 0.
Detect underflow and don't touch bios_start for such cases. It makes
kernel ignore E820 table on machines that doesn't have two usable pages
below BIOS_START_MAX.
Fixes: 1b3a626436 ("x86/boot/compressed/64: Validate trampoline placement against E820")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203463
Link: https://lkml.kernel.org/r/20190813131654.24378-1-kirill.shutemov@linux.intel.com
Some newer machines do not advertise legacy timers. The kernel can handle
that situation if the TSC and the CPU frequency are enumerated by CPUID or
MSRs and the CPU supports TSC deadline timer. If the CPU does not support
TSC deadline timer the local APIC timer frequency has to be known as well.
Some Ryzens machines do not advertize legacy timers, but there is no
reliable way to determine the bus frequency which feeds the local APIC
timer when the machine allows overclocking of that frequency.
As there is no legacy timer the local APIC timer calibration crashes due to
a NULL pointer dereference when accessing the not installed global clock
event device.
Switch the calibration loop to a non interrupt based one, which polls
either TSC (if frequency is known) or jiffies. The latter requires a global
clockevent. As the machines which do not have a global clockevent installed
have a known TSC frequency this is a non issue. For older machines where
TSC frequency is not known, there is no known case where the legacy timers
do not exist as that would have been reported long ago.
Reported-by: Daniel Drake <drake@endlessm.com>
Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908091443030.21433@nanos.tec.linutronix.de
Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1142926#c12