Commit Graph

7389 Commits

Author SHA1 Message Date
Linus Torvalds
7f680d7ec3 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "Another pile of small fixes and updates for x86:

   - Plug a hole in the SMAP implementation which misses to clear AC on
     NMI entry

   - Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line
     parameter works correctly again

   - Use the proper accessor in the startup64 code for next_early_pgt to
     prevent accessing of invalid addresses and faulting in the early
     boot code.

   - Prevent CPU hotplug lock recursion in the MTRR code

   - Unbreak CPU0 hotplugging

   - Rename overly long CPUID bits which got introduced in this cycle

   - Two commits which mark data 'const' and restrict the scope of data
     and functions to file scope by making them 'static'"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Constify attribute_group structures
  x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'
  x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks
  x86: Fix norandmaps/ADDR_NO_RANDOMIZE
  x86/mtrr: Prevent CPU hotplug lock recursion
  x86: Mark various structures and functions as 'static'
  x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag
  x86/smpboot: Unbreak CPU0 hotplug
  x86/asm/64: Clear AC on NMI entries
2017-08-20 09:36:52 -07:00
Kees Cook
c715b72c1b mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer.  This is a partial revert of:

  eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
  02445990a9 ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")

The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.

The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack.  This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).

The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again.  Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found.  (e.g.  always loading PIE into
the mmap region for marked binaries.)

Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast
Fixes: eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Fixes: 02445990a9 ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Kostya Serebryany <kcc@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18 15:32:02 -07:00
David Hildenbrand
bb97a01693 KVM: VMX: cleanup EPTP definitions
Don't use shifts, tag them correctly as EPTP and use better matching
names (PWL vs. GAW).

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 17:38:01 +02:00
Brijesh Singh
618232e219 KVM: x86: Avoid guest page table walk when gpa_available is set
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.

Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
 new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:49 +02:00
Ingo Molnar
0c23647913 Merge branch 'x86/asm' into locking/core
We need the ASM_UNREACHABLE() macro for a dependent patch.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-18 10:29:54 +02:00
Kees Cook
7a46ec0e2f locking/refcounts, x86/asm: Implement fast refcount overflow protection
This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.

This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.

Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:

  http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

  http://cyseclabs.com/page?n=02012016

This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.

On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).

On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.

As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.

Example assembly comparison:

refcount_inc() before:

  .text:
  ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)

refcount_inc() after:

  .text:
  ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
  ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
  ...
  .text.unlikely:
  ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
  ffffffff816c36d7:       0f ff                   (bad)

These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):

  2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
		    cycles		protections
  atomic_t           82249267387	none
  refcount_t-fast    82211446892	overflow, untested dec-to-zero
  refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero

This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Hans Liljestrand <ishkamiel@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 10:40:26 +02:00
Tony Luck
ce0fa3e56a x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Thanks to Dave Hansen for reviewing several iterations of this.

Also see:

  http://marc.info/?l=linux-mm&m=149860136413338&w=2

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 10:30:49 +02:00
Ingo Molnar
927d2c21f2 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 09:41:41 +02:00
Greg Kroah-Hartman
d985524680 Merge 4.13-rc5 into char-misc-next
We want the firmware, and other changes, in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-14 13:29:31 -07:00
Vikas Shivappa
a9110b552d x86/intel_rdt: Modify the intel_pqr_state for better performance
Currently we have pqr_state and rdt_default_state which store the cached
CLOSID/RMIDs and the user configured cpu default values respectively. We
touch both of these during context switch. Put all of them in one
structure so that we can spare a cache line.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: sai.praneeth.prakhya@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1502304395-7166-3-git-send-email-vikas.shivappa@linux.intel.com
2017-08-14 11:47:47 +02:00
Linus Torvalds
043cd07c55 Merge tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
 "Some fixes for Xen:

   - a fix for a regression introduced in 4.13 for a Xen HVM-guest
     configured with KASLR

   - a fix for a possible deadlock in the xenbus driver when booting the
     system

   - a fix for lost interrupts in Xen guests"

* tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: Fix interrupt lost during irq_disable and irq_enable
  xen: avoid deadlock in xenbus
  xen: fix hvm guest with kaslr enabled
  xen: split up xen_hvm_init_shared_info()
  x86: provide an init_mem_mapping hypervisor hook
2017-08-12 09:01:36 -07:00
Juergen Gross
c138d81163 x86: provide an init_mem_mapping hypervisor hook
Provide a hook in hypervisor_x86 called after setting up initial
memory mapping.

This is needed e.g. by Xen HVM guests to map the hypervisor shared
info page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
2017-08-11 15:50:21 +02:00
Borislav Petkov
5442c26995 x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag
"virtual_vmload_vmsave" is what is going to land in /proc/cpuinfo now
as per v4.13-rc4, for a single feature bit which is clearly too long.

So rename it to what it is called in the processor manual.
"v_vmsave_vmload" is a bit shorter, after all.

We could go more aggressively here but having it the same as in the
processor manual is advantageous.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm-ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170801185552.GA3743@nazgul.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11 13:42:28 +02:00
Vitaly Kuznetsov
2ffd9e33ce x86/hyper-v: Use hypercall for remote TLB flush
Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.

Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.

pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().

It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-8-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 20:16:44 +02:00
Andy Lutomirski
9584d98bed x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
In ELF_COPY_CORE_REGS, we're copying from the current task, so
accessing thread.fsbase and thread.gsbase makes no sense.  Just read
the values from the CPU registers.

In practice, the old code would have been correct most of the time
simply because thread.fsbase and thread.gsbase usually matched the
CPU registers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang Seok <chang.seok.bae@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 17:15:13 +02:00
Vitaly Kuznetsov
7415aea607 hyper-v: Globalize vp_index
To support implementing remote TLB flushing on Hyper-V with a hypercall
we need to make vp_index available outside of vmbus module. Rename and
globalize.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-7-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 16:50:23 +02:00
Vitaly Kuznetsov
806c89273b x86/hyper-v: Implement rep hypercalls
Rep hypercalls are normal hypercalls which perform multiple actions at
once. Hyper-V guarantees to return exectution to the caller in not more
than 50us and the caller needs to use hypercall continuation. Touch NMI
watchdog between hypercall invocations.

This is going to be used for HvFlushVirtualAddressList hypercall for
remote TLB flushing.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-6-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 16:50:22 +02:00
Vitaly Kuznetsov
6a8edbd0c5 x86/hyper-v: Introduce fast hypercall implementation
Hyper-V supports 'fast' hypercalls when all parameters are passed through
registers. Implement an inline version of a simpliest of these calls:
hypercall with one 8-byte input and no output.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-4-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 16:50:22 +02:00
Vitaly Kuznetsov
fc53662f13 x86/hyper-v: Make hv_do_hypercall() inline
We have only three call sites for hv_do_hypercall() and we're going to
change HVCALL_SIGNAL_EVENT to doing fast hypercall so we can inline this
function for optimization.

Hyper-V top level functional specification states that r9-r11 registers
and flags may be clobbered by the hypervisor during hypercall and with
inlining this is somewhat important, add the clobbers.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-3-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 16:50:22 +02:00
Vitaly Kuznetsov
79cadff2d9 x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set
Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the
'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c
which is included when CONFIG_HYPERVISOR_GUEST is set.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170802160921.21791-2-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 16:50:22 +02:00
Paolo Bonzini
eebed24389 kvm: nVMX: Add support for fast unprotection of nested guest page tables
This is the same as commit 147277540b ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 16:44:04 +02:00
Josh Poimboeuf
af79ded44b x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
Apparently the binutils 2.20 assembler can't handle the '&&' operator in
the UNWIND_HINT_REGS macro.  Rearrange the macro to do without it.

This fixes the following error:

  arch/x86/entry/entry_64.S: Assembler messages:
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement
  arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 39358a033b ("objtool, x86: Add facility for asm code to provide unwind hints")
Link: http://lkml.kernel.org/r/e2ad97c1ae49a484644b4aaa4dd3faa4d6d969b2.1502116651.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 14:16:19 +02:00
Andy Lutomirski
603e492e86 x86/asm/32: Fix regs_get_register() on segment registers
The segment register high words on x86_32 may contain garbage.
Teach regs_get_register() to read them as u16 instead of unsigned
long.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/0b76f6dbe477b7b1a81938fddcc3c483d48f0ff2.1502314765.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 13:14:58 +02:00
Ingo Molnar
1d0f49e140 Merge branch 'x86/urgent' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 13:14:15 +02:00
Ingo Molnar
388f8e1273 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:20:53 +02:00
Ingo Molnar
fc33a8943e Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:10:19 +02:00
Janakarajan Natarajan
910448bbed perf/x86/amd/uncore: Rename cpufeatures macro for cache counters
In Family 17h, L3 is the last level cache as opposed to L2 in previous
families. Avoid this name confusion and rename X86_FEATURE_PERFCTR_L2 to
X86_FEATURE_PERFCTR_LLC to indicate the performance counter on the last
level of cache.

Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/016311029fdecdc3fdc13b7ed865c6cbf48b2f15.1497452002.git.Janakarajan.Natarajan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:08:38 +02:00
Longpeng(Mike)
de63ad4cf4 KVM: X86: implement the logic for spinlock optimization
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Bandan Das
41ab937274 KVM: nVMX: Emulate EPTP switching for the L1 hypervisor
When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-07 15:29:22 +02:00
Bandan Das
2a499e49c2 KVM: vmx: Enable VMFUNCs
Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-07 15:29:20 +02:00
Lukas Wunner
630b3aff8a treewide: Consolidate Apple DMI checks
We're about to amend ACPI bus scan with DMI checks whether we're running
on a Mac to support Apple device properties in AML.  The DMI checks are
performed for every single device, adding overhead for everything x86
that isn't Apple, which is the majority.  Rafael and Andy therefore
request to perform the DMI match only once and cache the result.

Outside of ACPI various other Apple DMI checks exist and it seems
reasonable to use the cached value there as well.  Rafael, Andy and
Darren suggest performing the DMI check in arch code and making it
available with a header in include/linux/platform_data/x86/.

To this end, add early_platform_quirks() to arch/x86/kernel/quirks.c
to perform the DMI check and invoke it from setup_arch().  Switch over
all existing Apple DMI checks, thereby fixing two deficiencies:

* They are now #defined to false on non-x86 arches and can thus be
  optimized away if they're located in cross-arch code.

* Some of them only match "Apple Inc." but not "Apple Computer, Inc.",
  which is used by BIOSes released between January 2006 (when the first
  x86 Macs started shipping) and January 2007 (when the company name
  changed upon introduction of the iPhone).

Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Suggested-by: Darren Hart <dvhart@infradead.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-03 23:26:22 +02:00
Vikas Shivappa
748b6b881c x86/intel_rdt/cqm: Add sched_in support
OS associates an RMID/CLOSid to a task by writing the per CPU
IA32_PQR_ASSOC MSR when a task is scheduled in.

The sched_in code will stay as no-op unless we are running on Intel SKU
which supports either resource control or monitoring and we also enable
them by mounting the resctrl fs.  The per cpu CLOSid/RMID values are
cached and the write is performed only when a task with a different
CLOSid/RMID is scheduled in.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-25-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:28 +02:00
Vikas Shivappa
4be6c07842 x86/intel_rdt: Introduce rdt_enable_key for scheduling
Introduce the usage of rdt_enable_key in sched_in code as a preparation
to add RDT monitoring support for sched_in.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-24-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:27 +02:00
Vikas Shivappa
b09d981b3f x86/intel_rdt: Prepare to add RDT monitor cpus file support
Separate the ctrl cpus file handling from the generic cpus file handling
and convert the per cpu closid from u32 to a struct which will be used
later to add rmid to the same struct. Also cleanup some name space.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-17-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:25 +02:00
Vikas Shivappa
0734ded1ab x86/intel_rdt: Change closid type from int to u32
OS associates a CLOSid(Class of service id) to a task by writing the
high 32 bits of per CPU IA32_PQR_ASSOC MSR when a task is scheduled in.
CPUID.(EAX=10H, ECX=1):EDX[15:0] enumerates the max CLOSID supported and
it is zero indexed. Hence change the type to u32 from int.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-15-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:24 +02:00
Vikas Shivappa
1b5c0b7583 x86/intel_rdt: Cleanup namespace to support RDT monitoring
Few of the data-structures have generic names although they are RDT
allocation specific. Rename them to be allocation specific to
accommodate RDT monitoring. E.g. s/enabled/alloc_enabled/

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-7-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:20 +02:00
Vikas Shivappa
0583020456 x86/intel_rdt: Change file names to accommodate RDT monitor code
Because the "perf cqm" and resctrl code were separately added and
indivdually configurable, there seem to be separate context switch code
and also things on global .h which are not really needed.

Move only the scheduling specific code and definitions to
<asm/intel_rdt_sched.h> and the put all the other declarations to a
local intel_rdt.h.

h/t to Reinette Chatre for pointing out that we should separate the
public interfaces used by other parts of the kernel from private
objects shared between the various files comprising RDT.

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-5-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:19 +02:00
Vikas Shivappa
f01d7d51f5 x86/intel_rdt: Introduce a common compile option for RDT
We currently have a CONFIG_RDT_A which is for RDT(Resource directory
technology) allocation based resctrl filesystem interface. As a
preparation to add support for RDT monitoring as well into the same
resctrl filesystem, change the config option to be CONFIG_RDT which
would include both RDT allocation and monitoring code.

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-4-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:19 +02:00
Vikas Shivappa
c39a0e2c88 x86/perf/cqm: Wipe out perf based cqm
'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support.  The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:

    1. No support to monitor the same group of tasks for which we do
    allocation using resctrl.

    2. It gives random and inaccurate data (mostly 0s) once we run out
    of RMIDs due to issues in Recycling.

    3. Recycling results in inaccuracy of data because we cannot
    guarantee that the RMID was stolen from a task when it was not
    pulling data into cache or even when it pulled the least data. Also
    for monitoring llc_occupancy, if we stop using an RMID_x and then
    start using an RMID_y after we reclaim an RMID from an other event,
    we miss accounting all the occupancy that was tagged to RMID_x at a
    later perf_count.

    2. Recycling code makes the monitoring code complex including
    scheduling because the event can lose RMID any time. Since MBM
    counters count bandwidth for a period of time by taking snap shot of
    total bytes at two different times, recycling complicates the way we
    count MBM in a hierarchy. Also we need a spin lock while we do the
    processing to account for MBM counter overflow. We also currently
    use a spin lock in scheduling to prevent the RMID from being taken
    away.

    4. Lack of support when we run different kind of event like task,
    system-wide and cgroup events together. Data mostly prints 0s. This
    is also because we can have only one RMID tied to a cpu as defined
    by the cqm hardware but a perf can at the same time tie multiple
    events during one sched_in.

    5. No support of monitoring a group of tasks. There is partial support
    for cgroup but it does not work once there is a hierarchy of cgroups
    or if we want to monitor a task in a cgroup and the cgroup itself.

    6. No support for monitoring tasks for the lifetime without perf
    overhead.

    7. It reported the aggregate cache occupancy or memory bandwidth over
    all sockets. But most cloud and VMM based use cases want to know the
    individual per-socket usage.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
2017-08-01 22:41:18 +02:00
Tom Lendacky
57bd1905b2 acpi, x86/mm: Remove encryption mask from ACPI page protection type
The arch_apei_get_mem_attributes() function is used to set the page
protection type for ACPI physical addresses. When SME is active, the
associated protection type cannot have the encryption mask set since the
ACPI tables live in un-encrypted memory - the kernel will see corrupted
data.

To fix this, create a new protection type, PAGE_KERNEL_NOENC, that is a
'no encryption' version of PAGE_KERNEL, and return that from
arch_apei_get_mem_attributes().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e1cb9395b2f061cd96f1e59c3cbbe5ff5d4ec26e.1501186516.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-30 12:09:12 +02:00
Tom Lendacky
4e237903f9 x86/mm, kexec: Fix memory corruption with SME on successive kexecs
After issuing successive kexecs it was found that the SHA hash failed
verification when booting the kexec'd kernel.  When SME is enabled, the
change from using pages that were marked encrypted to now being marked as
not encrypted (through new identify mapped page tables) results in memory
corruption if there are any cache entries for the previously encrypted
pages. This is because separate cache entries can exist for the same
physical location but tagged both with and without the encryption bit.

To prevent this, issue a wbinvd if SME is active before copying the pages
from the source location to the destination location to clear any possible
cache entry conflicts.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <kexec@lists.infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e7fb8610af3a93e8f8ae6f214cd9249adc0df2b4.1501186516.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-30 12:09:12 +02:00
Andy Lutomirski
99504819fc x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
Now that pt_regs properly defines segment fields as 16-bit on 32-bit
CPUs, there's no need to mask off the high word.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-30 12:04:41 +02:00
Andy Lutomirski
385eca8f27 x86/asm/32: Make pt_regs's segment registers be 16 bits
Many 32-bit x86 CPUs do 16-bit writes when storing segment registers to
memory.  This can cause the high word of regs->[cdefgs]s to
occasionally contain garbage.

Rather than making the entry code more complicated to fix up the
garbage, just change pt_regs to reflect reality.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-30 12:04:40 +02:00
Dou Liyang
dbe04493ed x86/topology: Remove the unused parent_node() macro
Commit:

  a7be6e5a7f ("mm: drop useless local parameters of __register_one_node()")

... removed the last user of parent_node(), so remove the macro.

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1501076076-1974-11-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-27 10:53:00 +02:00
Wincy Van
210f84b0ca x86: irq: Define a global vector for nested posted interrupts
We are using the same vector for nested/non-nested posted
interrupts delivery, this may cause interrupts latency in
L1 since we can't kick the L2 vcpu out of vmx-nonroot mode.

This patch introduces a new vector which is only for nested
posted interrupts to solve the problems above.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-26 18:57:45 +02:00
Josh Poimboeuf
81d3871900 x86/kconfig: Consolidate unwinders into multiple choice selection
There are three mutually exclusive unwinders.  Make that more obvious by
combining them into a multiple-choice selection:

  CONFIG_FRAME_POINTER_UNWINDER
  CONFIG_ORC_UNWINDER
  CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)

Frame pointers are still the default (for now).

The old CONFIG_FRAME_POINTER option is still used in some
arch-independent places, so keep it around, but make it
invisible to the user on x86 - it's now selected by
CONFIG_FRAME_POINTER_UNWINDER=y.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-26 14:05:36 +02:00
Josh Poimboeuf
ee9f8fce99 x86/unwind: Add the ORC unwinder
Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com
[ Extended the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-26 13:18:20 +02:00
Ingo Molnar
25547b6b08 Merge branch 'WIP.locking/atomics' into locking/core
Merge two uncontroversial cleanups from this branch while the rest is being reworked.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-25 15:48:58 +02:00
Kees Cook
df3405245a x86/asm: Add suffix macro for GEN_*_RMWcc()
The coming x86 refcount protection needs to be able to add trailing
instructions to the GEN_*_RMWcc() operations. This extracts the
difference between the goto/non-goto cases so the helper macros
can be defined outside the #ifdef cases. Additionally adds argument
naming to the resulting asm for referencing from suffixed
instructions, and adds clobbers for "cc", and "cx" to let suffixes
use _ASM_CX, and retain any set flags.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Hans Liljestrand <ishkamiel@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1500921349-10803-2-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-25 11:18:09 +02:00
Andy Lutomirski
10af6235e0 x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs).  I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers.  Unpatched means commit a4eb8b9935, so all the
bookkeeping overhead is gone.

ping-pong between two mms on the same CPU using eventfd:

  patched:         1.22µs
  patched, nopcid: 1.33µs
  unpatched:       1.34µs

Same ping-pong, but now touch 512 pages (all zero-page to minimize
cache misses) each iteration.  dTLB misses are measured by
dtlb_load_misses.miss_causes_a_walk:

  patched:         1.8µs  11M  dTLB misses
  patched, nopcid: 6.2µs, 207M dTLB misses
  unpatched:       6.1µs, 190M dTLB misses

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/9ee75f17a81770feed616358e6860d98a2a5b1e7.1500957502.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-25 11:16:12 +02:00