Pull x86 fixes from Thomas Gleixner:
"Another pile of small fixes and updates for x86:
- Plug a hole in the SMAP implementation which misses to clear AC on
NMI entry
- Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line
parameter works correctly again
- Use the proper accessor in the startup64 code for next_early_pgt to
prevent accessing of invalid addresses and faulting in the early
boot code.
- Prevent CPU hotplug lock recursion in the MTRR code
- Unbreak CPU0 hotplugging
- Rename overly long CPUID bits which got introduced in this cycle
- Two commits which mark data 'const' and restrict the scope of data
and functions to file scope by making them 'static'"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Constify attribute_group structures
x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'
x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks
x86: Fix norandmaps/ADDR_NO_RANDOMIZE
x86/mtrr: Prevent CPU hotplug lock recursion
x86: Mark various structures and functions as 'static'
x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag
x86/smpboot: Unbreak CPU0 hotplug
x86/asm/64: Clear AC on NMI entries
Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer. This is a partial revert of:
eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
02445990a9 ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.
The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack. This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).
The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again. Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found. (e.g. always loading PIE into
the mmap region for marked binaries.)
Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast
Fixes: eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Fixes: 02445990a9 ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Kostya Serebryany <kcc@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't use shifts, tag them correctly as EPTP and use better matching
names (PWL vs. GAW).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.
Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.
This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.
Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:
http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/http://cyseclabs.com/page?n=02012016
This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.
On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).
On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.
As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.
Example assembly comparison:
refcount_inc() before:
.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
refcount_inc() after:
.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3
...
.text.unlikely:
ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx
ffffffff816c36d7: 0f ff (bad)
These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
cycles protections
atomic_t 82249267387 none
refcount_t-fast 82211446892 overflow, untested dec-to-zero
refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero
This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Hans Liljestrand <ishkamiel@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull xen fixes from Juergen Gross:
"Some fixes for Xen:
- a fix for a regression introduced in 4.13 for a Xen HVM-guest
configured with KASLR
- a fix for a possible deadlock in the xenbus driver when booting the
system
- a fix for lost interrupts in Xen guests"
* tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: Fix interrupt lost during irq_disable and irq_enable
xen: avoid deadlock in xenbus
xen: fix hvm guest with kaslr enabled
xen: split up xen_hvm_init_shared_info()
x86: provide an init_mem_mapping hypervisor hook
Provide a hook in hypervisor_x86 called after setting up initial
memory mapping.
This is needed e.g. by Xen HVM guests to map the hypervisor shared
info page.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
This is the same as commit 147277540b ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Apparently the binutils 2.20 assembler can't handle the '&&' operator in
the UNWIND_HINT_REGS macro. Rearrange the macro to do without it.
This fixes the following error:
arch/x86/entry/entry_64.S: Assembler messages:
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 39358a033b ("objtool, x86: Add facility for asm code to provide unwind hints")
Link: http://lkml.kernel.org/r/e2ad97c1ae49a484644b4aaa4dd3faa4d6d969b2.1502116651.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Enable VMFUNC in the secondary execution controls. This simplifies the
changes necessary to expose it to nested hypervisors. VMFUNCs still
cause #UD when invoked.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
We're about to amend ACPI bus scan with DMI checks whether we're running
on a Mac to support Apple device properties in AML. The DMI checks are
performed for every single device, adding overhead for everything x86
that isn't Apple, which is the majority. Rafael and Andy therefore
request to perform the DMI match only once and cache the result.
Outside of ACPI various other Apple DMI checks exist and it seems
reasonable to use the cached value there as well. Rafael, Andy and
Darren suggest performing the DMI check in arch code and making it
available with a header in include/linux/platform_data/x86/.
To this end, add early_platform_quirks() to arch/x86/kernel/quirks.c
to perform the DMI check and invoke it from setup_arch(). Switch over
all existing Apple DMI checks, thereby fixing two deficiencies:
* They are now #defined to false on non-x86 arches and can thus be
optimized away if they're located in cross-arch code.
* Some of them only match "Apple Inc." but not "Apple Computer, Inc.",
which is used by BIOSes released between January 2006 (when the first
x86 Macs started shipping) and January 2007 (when the company name
changed upon introduction of the iPhone).
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Suggested-by: Darren Hart <dvhart@infradead.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support. The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:
1. No support to monitor the same group of tasks for which we do
allocation using resctrl.
2. It gives random and inaccurate data (mostly 0s) once we run out
of RMIDs due to issues in Recycling.
3. Recycling results in inaccuracy of data because we cannot
guarantee that the RMID was stolen from a task when it was not
pulling data into cache or even when it pulled the least data. Also
for monitoring llc_occupancy, if we stop using an RMID_x and then
start using an RMID_y after we reclaim an RMID from an other event,
we miss accounting all the occupancy that was tagged to RMID_x at a
later perf_count.
2. Recycling code makes the monitoring code complex including
scheduling because the event can lose RMID any time. Since MBM
counters count bandwidth for a period of time by taking snap shot of
total bytes at two different times, recycling complicates the way we
count MBM in a hierarchy. Also we need a spin lock while we do the
processing to account for MBM counter overflow. We also currently
use a spin lock in scheduling to prevent the RMID from being taken
away.
4. Lack of support when we run different kind of event like task,
system-wide and cgroup events together. Data mostly prints 0s. This
is also because we can have only one RMID tied to a cpu as defined
by the cqm hardware but a perf can at the same time tie multiple
events during one sched_in.
5. No support of monitoring a group of tasks. There is partial support
for cgroup but it does not work once there is a hierarchy of cgroups
or if we want to monitor a task in a cgroup and the cgroup itself.
6. No support for monitoring tasks for the lifetime without perf
overhead.
7. It reported the aggregate cache occupancy or memory bandwidth over
all sockets. But most cloud and VMM based use cases want to know the
individual per-socket usage.
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
We are using the same vector for nested/non-nested posted
interrupts delivery, this may cause interrupts latency in
L1 since we can't kick the L2 vcpu out of vmx-nonroot mode.
This patch introduces a new vector which is only for nested
posted interrupts to solve the problems above.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.
It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.
For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.
Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
[ Extended the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PCID is a "process context ID" -- it's what other architectures call
an address space ID. Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB. x86's
PCID is 12 bits.
This is an unorthodox approach to using PCID. x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs. To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.
This patch uses PCID differently. We use a PCID to identify a
recently-used mm on a per-cpu basis. An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.
Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs). I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers. Unpatched means commit a4eb8b9935, so all the
bookkeeping overhead is gone.
ping-pong between two mms on the same CPU using eventfd:
patched: 1.22µs
patched, nopcid: 1.33µs
unpatched: 1.34µs
Same ping-pong, but now touch 512 pages (all zero-page to minimize
cache misses) each iteration. dTLB misses are measured by
dtlb_load_misses.miss_causes_a_walk:
patched: 1.8µs 11M dTLB misses
patched, nopcid: 6.2µs, 207M dTLB misses
unpatched: 6.1µs, 190M dTLB misses
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/9ee75f17a81770feed616358e6860d98a2a5b1e7.1500957502.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>