Commit Graph

12766 Commits

Author SHA1 Message Date
Linus Torvalds
a38ecbbd0b Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "A CR4-shadow 32-bit init fix, plus two typo fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Init per-cpu shadow copy of CR4 on 32-bit CPUs too
  x86/platform/intel-mid: Fix trivial printk message typo in intel_mid_arch_setup()
  x86/cpu/intel: Fix trivial typo in intel_tlb_table[]
2015-03-01 12:22:44 -08:00
Linus Torvalds
d7b48fec35 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Two kprobes fixes and a handful of tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Make sparc64 arch point to sparc
  perf symbols: Define EM_AARCH64 for older OSes
  perf top: Fix SIGBUS on sparc64
  perf tools: Fix probing for PERF_FLAG_FD_CLOEXEC flag
  perf tools: Fix pthread_attr_setaffinity_np build error
  perf tools: Define _GNU_SOURCE on pthread_attr_setaffinity_np feature check
  perf bench: Fix order of arguments to memcpy_alloc_mem
  kprobes/x86: Check for invalid ftrace location in __recover_probed_insn()
  kprobes/x86: Use 5-byte NOP when the code might be modified by ftrace
2015-03-01 11:56:13 -08:00
Steven Rostedt
5b2bdbc845 x86: Init per-cpu shadow copy of CR4 on 32-bit CPUs too
Commit:

   1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")

added a shadow CR4 such that reads and writes that do not
modify the CR4 execute much faster than always reading the
register itself.

The change modified cpu_init() in common.c, so that the
shadow CR4 gets initialized before anything uses it.

Unfortunately, there's two cpu_init()s in common.c. There's
one for 64-bit and one for 32-bit. The commit only added
the shadow init to the 64-bit path, but the 32-bit path
needs the init too.

Link: http://lkml.kernel.org/r/20150227125208.71c36402@gandalf.local.home Fixes: 1e02ce4ccc "x86: Store a per-cpu shadow copy of CR4"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150227145019.2bdd4354@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-28 08:04:20 +01:00
Ingo Molnar
5838d18955 Merge branch 'linus' into x86/urgent, to merge dependent patch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-28 08:03:10 +01:00
Wang Nan
b4d8327024 x86/traps: Enable DEBUG_STACK after cpu_init() for TRAP_DB/BP
Before this patch early_trap_init() installs DEBUG_STACK for
X86_TRAP_BP and X86_TRAP_DB. However, DEBUG_STACK doesn't work
correctly until cpu_init() <-- trap_init().

This patch passes 0 to set_intr_gate_ist() and
set_system_intr_gate_ist() instead of DEBUG_STACK to let it use
same stack as kernel, and installs DEBUG_STACK for them in
trap_init().

As core runs at ring 0 between early_trap_init() and
trap_init(), there is no chance to get a bad stack before
trap_init().

As NMI is also enabled in trap_init(), we don't need to care
about is_debug_stack() and related things used in
arch/x86/kernel/nmi.c.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <dave.hansen@linux.intel.com>
Cc: <lizefan@huawei.com>
Cc: <luto@amacapital.net>
Cc: <oleg@redhat.com>
Link: http://lkml.kernel.org/r/1424929779-13174-1-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-26 12:29:20 +01:00
Ingo Molnar
e9e4e44309 Merge tag 'v4.0-rc1' into perf/core, to refresh the tree
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-26 12:24:50 +01:00
Matt Fleming
59bf7fd45c perf/x86/intel: Enable conflicting event scheduling for CQM
We can leverage the workqueue that we use for RMID rotation to support
scheduling of conflicting monitoring events. Allowing events that
monitor conflicting things is done at various other places in the perf
subsystem, so there's precedent there.

An example of two conflicting events would be monitoring a cgroup and
simultaneously monitoring a task within that cgroup.

This uses the cache_groups list as a queuing mechanism, where every
event that reaches the front of the list gets the chance to be scheduled
in, possibly descheduling any conflicting events that are running.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/1422038748-21397-10-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 13:53:36 +01:00
Matt Fleming
bff671dba7 perf/x86/intel: Perform rotation on Intel CQM RMIDs
There are many use cases where people will want to monitor more tasks
than there exist RMIDs in the hardware, meaning that we have to perform
some kind of multiplexing.

We do this by "rotating" the RMIDs in a workqueue, and assigning an RMID
to a waiting event when the RMID becomes unused.

This scheme reserves one RMID at all times for rotation. When we need to
schedule a new event we give it the reserved RMID, pick a victim event
from the front of the global CQM list and wait for the victim's RMID to
drop to zero occupancy, before it becomes the new reserved RMID.

We put the victim's RMID onto the limbo list, where it resides for a
"minimum queue time", which is intended to save ourselves an expensive
smp IPI when the RMID is unlikely to have a occupancy value below
__intel_cqm_threshold.

If we fail to recycle an RMID, even after waiting the minimum queue time
then we need to increment __intel_cqm_threshold. There is an upper bound
on this threshold, __intel_cqm_max_threshold, which is programmable from
userland as /sys/devices/intel_cqm/max_recycling_threshold.

The comments above __intel_cqm_rmid_rotate() have more details.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/1422038748-21397-9-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 13:53:35 +01:00
Matt Fleming
bfe1fcd268 perf/x86/intel: Support task events with Intel CQM
Add support for task events as well as system-wide events. This change
has a big impact on the way that we gather LLC occupancy values in
intel_cqm_event_read().

Currently, for system-wide (per-cpu) events we defer processing to
userspace which knows how to discard all but one cpu result per package.

Things aren't so simple for task events because we need to do the value
aggregation ourselves. To do this, we defer updating the LLC occupancy
value in event->count from intel_cqm_event_read() and do an SMP
cross-call to read values for all packages in intel_cqm_event_count().
We need to ensure that we only do this for one task event per cache
group, otherwise we'll report duplicate values.

If we're a system-wide event we want to fallback to the default
perf_event_count() implementation. Refactor this into a common function
so that we don't duplicate the code.

Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
driver.  This task information is only availble in the upper layers of
the perf infrastructure.

Other perf backends stash the target task in event->hw.*target so we
need to do something similar. The task is used to determine whether
events should share a cache group and an RMID.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 13:53:34 +01:00
Matt Fleming
35298e554c perf/x86/intel: Implement LRU monitoring ID allocation for CQM
It's possible to run into issues with re-using unused monitoring IDs
because there may be stale cachelines associated with that ID from a
previous allocation. This can cause the LLC occupancy values to be
inaccurate.

To attempt to mitigate this problem we place the IDs on a least recently
used list, essentially a FIFO. The basic idea is that the longer the
time period between ID re-use the lower the probability that stale
cachelines exist in the cache.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/1422038748-21397-7-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 13:53:33 +01:00
Matt Fleming
4afbb24ce5 perf/x86/intel: Add Intel Cache QoS Monitoring support
Future Intel Xeon processors support a Cache QoS Monitoring feature that
allows tracking of the LLC occupancy for a task or task group, i.e. the
amount of data in pulled into the LLC for the task (group).

Currently the PMU only supports per-cpu events. We create an event for
each cpu and read out all the LLC occupancy values.

Because this results in duplicate values being written out to userspace,
we also export a .per-pkg event file so that the perf tools only
accumulate values for one cpu per package.

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/1422038748-21397-6-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 13:53:32 +01:00
Peter P Waskiewicz Jr
cbc82b1726 x86: Add support for Intel Cache QoS Monitoring (CQM) detection
This patch adds support for the new Cache QoS Monitoring (CQM)
feature found in future Intel Xeon processors.  It includes the
new values to track CQM resources to the cpuinfo_x86 structure,
plus the CPUID detection routines for CQM.

CQM allows a process, or set of processes, to be tracked by the CPU
to determine the cache usage of that task group.  Using this data
from the CPU, software can be written to extract this data and
report cache usage and occupancy for a particular process, or
group of processes.

More information about Cache QoS Monitoring can be found in the
Intel (R) x86 Architecture Software Developer Manual, section 17.14.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chris Webb <chris@arachsys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Honeyman <stevenhoneyman@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/1422038748-21397-5-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 13:53:31 +01:00
Andy Lutomirski
72c6fb4f74 x86/ia32-compat: Fix CLONE_SETTLS bitness of copy_thread()
CLONE_SETTLS is expected to write a TLS entry in the GDT for
32-bit callers and to set FSBASE for 64-bit callers.

The correct check is is_ia32_task(), which returns true in the
context of a 32-bit syscall.  TIF_IA32 is set if the task itself
has a 32-bit personality, which is not the same thing.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Link: http://lkml.kernel.org/r/45e2d0d695393d76406a0c7225b82c76223e0cc5.1424822291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 08:27:50 +01:00
Andy Lutomirski
08571f1ae3 x86/ptrace: Remove checks for TIF_IA32 when changing CS and SS
The ability for modified CS and/or SS to be useful has nothing
to do with TIF_IA32.  Similarly, if there's an exploit involving
changing CS or SS, it's exploitable with or without a TIF_IA32
check.

So just delete the check.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Link: http://lkml.kernel.org/r/71c7ab36456855d11ae07edd4945a7dfe80f9915.1424822291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-25 08:27:49 +01:00
Juergen Gross
8d4a40bc06 x86/mm: Use early_memunmap() instead of early_iounmap()
Memory mapped via early_memremap() should be unmapped with
early_memunmap() instead of early_iounmap().

Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: matt.fleming@intel.com
Link: http://lkml.kernel.org/r/1424769211-11378-2-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-24 15:58:06 +01:00
Adrien Schildknecht
04769ae3ac x86/kernel: Use kstack_end() in dumpstack_64.c
i386 is already using kstack_end() in dumpstack_32.c. We should also
use it to make the code clearer and unify the stack printing logic some
more.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Adrien Schildknecht <adrien+dev@schischi.me>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: c: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1424618638-6375-1-git-send-email-adrien+dev@schischi.me
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 18:37:13 +01:00
Adrien Schildknecht
1fc7f61c3e x86/kernel: Fix output of show_stack_log_lvl()
show_stack_log_lvl() does not set the log level after a new line, the
following messages printed with pr_cont() are thus assigned to the
default log level.

This patch prepends the log level to the next message following a new
line.

print_trace_address() uses printk(log_lvl). Using printk() with just
a log level is ignored and thus has no effect on the next pr_cont().
We need to prepend the log level directly into the message.

Signed-off-by: Adrien Schildknecht <adrien+dev@schischi.me>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1424399661-20327-1-git-send-email-adrien+dev@schischi.me
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 18:34:42 +01:00
David Vrabel
fdfd811ddd x86/xen: allow privcmd hypercalls to be preempted
Hypercalls submitted by user space tools via the privcmd driver can
take a long time (potentially many 10s of seconds) if the hypercall
has many sub-operations.

A fully preemptible kernel may deschedule such as task in any upcall
called from a hypercall continuation.

However, in a kernel with voluntary or no preemption, hypercall
continuations in Xen allow event handlers to be run but the task
issuing the hypercall will not be descheduled until the hypercall is
complete and the ioctl returns to user space.  These long running
tasks may also trigger the kernel's soft lockup detection.

Add xen_preemptible_hcall_begin() and xen_preemptible_hcall_end() to
bracket hypercalls that may be preempted.  Use these in the privcmd
driver.

When returning from an upcall, call xen_maybe_preempt_hcall() which
adds a schedule point if if the current task was within a preemptible
hypercall.

Since _cond_resched() can move the task to a different CPU, clear and
set xen_in_preemptible_hcall around the call.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2015-02-23 16:30:24 +00:00
Oleg Nesterov
110d7f7513 x86/fpu: Don't abuse FPU in kernel threads if use_eager_fpu()
AFAICS, there is no reason why kernel threads should have FPU context
even if use_eager_fpu() == T. Now that interrupted_kernel_fpu_idle()
does not check __thread_has_fpu() in the use_eager_fpu() case, we
can remove the init_fpu() code from eager_fpu_init() and change
flush_thread() called by do_execve() to initialize FPU.

Note: of course, the change in flush_thread() is horrible and must be
cleanuped. We need the new helper, and flush_thread() should return the
error if init_fpu() fails.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/20150119185212.GD16427@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 15:50:45 +01:00
Oleg Nesterov
4b2e762e2e x86/fpu: Always allow FPU in interrupt if use_eager_fpu()
The __thread_has_fpu() check in interrupted_kernel_fpu_idle() was needed
to prevent the nested kernel_fpu_begin(). Now that we have in_kernel_fpu
and !__thread_has_fpu() case in __kernel_fpu_begin() does not depend on
use_eager_fpu() (except clts) we can remove it.

__thread_has_fpu() can be false even if use_eager_fpu(), but this case
does not differ from !use_eager_fpu() case except we should not worry
about X86_CR0_TS, __kernel_fpu_begin()/end() will not touch this bit.

Note: I think we can kill all irq_fpu_usable() checks except in_kernel_fpu,
just we need to record the state of X86_CR0_TS in __kernel_fpu_begin() and
conditionalize stts() in __kernel_fpu_end(), but this needs another patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150119185151.GC16427@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 15:50:41 +01:00
Oleg Nesterov
7aeccb83e7 x86/fpu: __kernel_fpu_begin() should clear fpu_owner_task even if use_eager_fpu()
__kernel_fpu_begin() does nothing if !__thread_has_fpu() && use_eager_fpu(),
perhaps it assumes that this case is simply impossible. This is certainly
not possible if in_interrupt() == T; interrupted_user_mode() should have
FPU, and interrupted_kernel_fpu_idle() should fail if !__thread_has_fpu().

However, even if use_eager_fpu() == T a task can do drop_fpu(), then switch
to another thread which becomes fpu_owner_task, then resume and call some
function which does kernel_fpu_begin(). Say, an exiting task does a lot of
things after exit_thread(), it is not safe to assume that it can't use FPU
in these paths.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Pekka Riikonen <priikone@iki.fi>
Link: http://lkml.kernel.org/r/20150119185132.GB16427@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 15:50:28 +01:00
Borislav Petkov
a930dc4543 x86/asm: Cleanup prefetch primitives
This is based on a patch originally by hpa.

With the current improvements to the alternatives, we can simply use %P1
as a mem8 operand constraint and rely on the toolchain to generate the
proper instruction sizes. For example, on 32-bit, where we use an empty
old instruction we get:

  apply_alternatives: feat: 6*32+8, old: (c104648b, len: 4), repl: (c195566c, len: 4)
  c104648b: alt_insn: 90 90 90 90
  c195566c: rpl_insn: 0f 0d 4b 5c

  ...

  apply_alternatives: feat: 6*32+8, old: (c18e09b4, len: 3), repl: (c1955948, len: 3)
  c18e09b4: alt_insn: 90 90 90
  c1955948: rpl_insn: 0f 0d 08

  ...

  apply_alternatives: feat: 6*32+8, old: (c1190cf9, len: 7), repl: (c1955a79, len: 7)
  c1190cf9: alt_insn: 90 90 90 90 90 90 90
  c1955a79: rpl_insn: 0f 0d 0d a0 d4 85 c1

all with the proper padding done depending on the size of the
replacement instruction the compiler generates.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
2015-02-23 13:44:17 +01:00
Borislav Petkov
8e65f6e03a x86/entry_32: Convert X86_INVD_BUG to ALTERNATIVE macro
Booting a 486 kernel on an AMD guest with this patch applied, says:

  apply_alternatives: feat: 0*32+25, old: (c160a475, len: 5), repl: (c19557d4, len: 5)
  c160a475: alt_insn: 68 10 35 00 c1
  c19557d4: rpl_insn: 68 80 39 00 c1

which is:

  old insn VA: 0xc160a475, CPU feat: X86_FEATURE_XMM, size: 5
  simd_coprocessor_error:
           c160a475:      68 10 35 00 c1          push $0xc1003510 <do_general_protection>
  repl insn: 0xc19557d4, size: 5
           c160a475:      68 80 39 00 c1          push $0xc1003980 <do_simd_coprocessor_error>

Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 13:44:15 +01:00
Borislav Petkov
4fd4b6e553 x86/alternatives: Use optimized NOPs for padding
Alternatives allow now for an empty old instruction. In this case we go
and pad the space with NOPs at assembly time. However, there are the
optimal, longer NOPs which should be used. Do that at patching time by
adding alt_instr.padlen-sized NOPs at the old instruction address.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 13:44:12 +01:00
Borislav Petkov
48c7a2509f x86/alternatives: Make JMPs more robust
Up until now we had to pay attention to relative JMPs in alternatives
about how their relative offset gets computed so that the jump target
is still correct. Or, as it is the case for near CALLs (opcode e8), we
still have to go and readjust the offset at patching time.

What is more, the static_cpu_has_safe() facility had to forcefully
generate 5-byte JMPs since we couldn't rely on the compiler to generate
properly sized ones so we had to force the longest ones. Worse than
that, sometimes it would generate a replacement JMP which is longer than
the original one, thus overwriting the beginning of the next instruction
at patching time.

So, in order to alleviate all that and make using JMPs more
straight-forward we go and pad the original instruction in an
alternative block with NOPs at build time, should the replacement(s) be
longer. This way, alternatives users shouldn't pay special attention
so that original and replacement instruction sizes are fine but the
assembler would simply add padding where needed and not do anything
otherwise.

As a second aspect, we go and recompute JMPs at patching time so that we
can try to make 5-byte JMPs into two-byte ones if possible. If not, we
still have to recompute the offsets as the replacement JMP gets put far
away in the .altinstr_replacement section leading to a wrong offset if
copied verbatim.

For example, on a locally generated kernel image

  old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
  __switch_to:
   ffffffff810014bd:      eb 21                   jmp ffffffff810014e0
  repl insn: size: 5
  ffffffff81d0b23c:       e9 b1 62 2f ff          jmpq ffffffff810014f2

gets corrected to a 2-byte JMP:

  apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
  alt_insn: e9 b1 62 2f ff
  recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
  converted to: eb 33 90 90 90

and a 5-byte JMP:

  old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
  __switch_to:
   ffffffff81001516:      eb 30                   jmp ffffffff81001548
  repl insn: size: 5
   ffffffff81d0b241:      e9 10 63 2f ff          jmpq ffffffff81001556

gets shortened into a two-byte one:

  apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
  alt_insn: e9 10 63 2f ff
  recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
  converted to: eb 3e 90 90 90

... and so on.

This leads to a net win of around

40ish replacements * 3 bytes savings =~ 120 bytes of I$

on an AMD guest which means some savings of precious instruction cache
bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
which on smart microarchitectures means discarding NOPs at decode time
and thus freeing up execution bandwidth.

Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 13:44:11 +01:00
Borislav Petkov
4332195c56 x86/alternatives: Add instruction padding
Up until now we have always paid attention to make sure the length of
the new instruction replacing the old one is at least less or equal to
the length of the old instruction. If the new instruction is longer, at
the time it replaces the old instruction it will overwrite the beginning
of the next instruction in the kernel image and cause your pants to
catch fire.

So instead of having to pay attention, teach the alternatives framework
to pad shorter old instructions with NOPs at buildtime - but only in the
case when

  len(old instruction(s)) < len(new instruction(s))

and add nothing in the >= case. (In that case we do add_nops() when
patching).

This way the alternatives user shouldn't have to care about instruction
sizes and simply use the macros.

Add asm ALTERNATIVE* flavor macros too, while at it.

Also, we need to save the pad length in a separate struct alt_instr
member for NOP optimization and the way to do that reliably is to carry
the pad length instead of trying to detect whether we're looking at
single-byte NOPs or at pathological instruction offsets like e9 90 90 90
90, for example, which is a valid instruction.

Thanks to Michael Matz for the great help with toolchain questions.

Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 13:44:00 +01:00
Borislav Petkov
db477a3386 x86/alternatives: Cleanup DPRINTK macro
Make it pass __func__ implicitly. Also, dump info about each replacing
we're doing. Fixup comments and style while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-23 13:35:50 +01:00
Yannick Guerrini
a927792c19 x86/cpu/intel: Fix trivial typo in intel_tlb_table[]
Change 'ssociative' to 'associative'

Signed-off-by: Yannick Guerrini <yguerrini@tomshardware.fr>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: Chris Bainbridge <chris.bainbridge@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Steven Honeyman <stevenhoneyman@gmail.com>
Cc: trivial@kernel.org
Link: http://lkml.kernel.org/r/1424558510-1420-1-git-send-email-yguerrini@tomshardware.fr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-22 08:55:58 +01:00
Linus Torvalds
10436cf881 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Two fixes: the paravirt spin_unlock() corruption/crash fix, and an
  rtmutex NULL dereference crash fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/spinlocks/paravirt: Fix memory corruption on unlock
  locking/rtmutex: Avoid a NULL pointer dereference on deadlock
2015-02-21 10:45:03 -08:00
Linus Torvalds
5fbe4c224c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
 "This contains:

   - EFI fixes
   - a boot printout fix
   - ASLR/kASLR fixes
   - intel microcode driver fixes
   - other misc fixes

  Most of the linecount comes from an EFI revert"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch
  x86/microcode/intel: Handle truncated microcode images more robustly
  x86/microcode/intel: Guard against stack overflow in the loader
  x86, mm/ASLR: Fix stack randomization on 64-bit systems
  x86/mm/init: Fix incorrect page size in init_memory_mapping() printks
  x86/mm/ASLR: Propagate base load address calculation
  Documentation/x86: Fix path in zero-page.txt
  x86/apic: Fix the devicetree build in certain configs
  Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes"
  x86/efi: Avoid triple faults during EFI mixed mode calls
2015-02-21 10:41:29 -08:00
Linus Torvalds
b5aeca54d0 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 uprobe/kprobe fixes from Ingo Molnar:
 "This contains two uprobes fixes, an uprobes comment update and a
  kprobes fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes/x86: Mark 2 bytes NOP as boostable
  uprobes/x86: Fix 2-byte opcode table
  uprobes/x86: Fix 1-byte opcode tables
  uprobes/x86: Add comment with insn opcodes, mnemonics and why we dont support them
2015-02-21 10:39:16 -08:00
Linus Torvalds
3f4d9925e9 Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rcu fix and x86 irq fix from Ingo Molnar:

 - Fix a bug that caused an RCU warning splat.

 - Two x86 irq related fixes: a hotplug crash fix and an ACPI IRQ
   registry fix.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Clear need_qs flag to prevent splat

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable()
  x86/irq: Fix regression caused by commit b568b8601f
2015-02-21 10:36:06 -08:00
Petr Mladek
2a6730c8b6 kprobes/x86: Check for invalid ftrace location in __recover_probed_insn()
__recover_probed_insn() should always be called from an address
where an instructions starts. The check for ftrace_location()
might help to discover a potential inconsistency.

This patch adds WARN_ON() when the inconsistency is detected.
Also it adds handling of the situation when the original code
can not get recovered.

Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Cc: Ananth NMavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1424441250-27146-3-git-send-email-pmladek@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-21 10:33:31 +01:00
Petr Mladek
650b7b23cb kprobes/x86: Use 5-byte NOP when the code might be modified by ftrace
can_probe() checks if the given address points to the beginning
of an instruction. It analyzes all the instructions from the
beginning of the function until the given address. The code
might be modified by another Kprobe. In this case, the current
code is read into a buffer, int3 breakpoint is replaced by the
saved opcode in the buffer, and can_probe() analyzes the buffer
instead.

There is a bug that __recover_probed_insn() tries to restore
the original code even for Kprobes using the ftrace framework.
But in this case, the opcode is not stored. See the difference
between arch_prepare_kprobe() and arch_prepare_kprobe_ftrace().
The opcode is stored by arch_copy_kprobe() only from
arch_prepare_kprobe().

This patch makes Kprobe to use the ideal 5-byte NOP when the
code can be modified by ftrace. It is the original instruction,
see ftrace_make_nop() and ftrace_nop_replace().

Note that we always need to use the NOP for ftrace locations.
Kprobes do not block ftrace and the instruction might get
modified at anytime. It might even be in an inconsistent state
because it is modified step by step using the int3 breakpoint.

The patch also fixes indentation of the touched comment.

Note that I found this problem when playing with Kprobes. I did
it on x86_64 with gcc-4.8.3 that supported -mfentry. I modified
samples/kprobes/kprobe_example.c and added offset 5 to put
the probe right after the fentry area:

 static struct kprobe kp = {
 	.symbol_name	= "do_fork",
+	.offset = 5,
 };

Then I was able to load kprobe_example before jprobe_example
but not the other way around:

  $> modprobe jprobe_example
  $> modprobe kprobe_example
  modprobe: ERROR: could not insert 'kprobe_example': Invalid or incomplete multibyte or wide character

It did not make much sense and debugging pointed to the bug
described above.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth NMavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1424441250-27146-2-git-send-email-pmladek@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-21 10:33:30 +01:00
Jiri Kosina
570e1aa84c x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch
Commit f47233c2d3 ("x86/mm/ASLR: Propagate base load address
calculation") causes PAGE_SIZE redefinition warnings for UML
subarch  builds. This is caused by added includes that were
leftovers from previous  patch versions are are not actually
needed (especially page_types.h  inlcude in module.c). Drop
those stray includes.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1502201017240.28769@pobox.suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-20 10:55:32 +01:00
Ingo Molnar
1fbe23e0de Merge tag 'microcode_fixes_for-3.21' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent
Pull microcode fixes from Borislav Petkov:

  - Two fixes hardening microcode data handling. (Quentin Casasnovas)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19 13:32:42 +01:00
Ingo Molnar
fa45a45ca3 Merge tag 'ras_for_3.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras
Pull RAS updates from Borislav Petkov:

 "- Enable AMD thresholding IRQ by default if supported. (Aravind Gopalakrishnan)

  - Unify mce_panic() message pattern. (Derek Che)

  - A bit more involved simplification of the CMCI logic after yet another
    report about race condition with the adaptive logic. (Borislav Petkov)

  - ACPI APEI EINJ fleshing out of the user documentation. (Borislav Petkov)

  - Minor cleanup. (Jan Beulich.)"

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19 13:31:33 +01:00
Aravind Gopalakrishnan
d79f931f1c x86/MCE/AMD: Enable thresholding interrupts by default if supported
We setup APIC vectors for threshold errors if interrupt_capable.
However, we don't set interrupt_enable by default. Rework
threshold_restart_bank() so that when we set up lvt_offset, we also set
IntType to APIC and also enable thresholding interrupts for banks which
support it by default.

User is still allowed to disable interrupts through sysfs.

While at it, check if status is valid before we proceed to log error
using mce_log. This is because, in multi-node platforms, only the NBC
(Node Base Core, i.e. the first core in the node) has valid status info
in its MCA registers. So, the decoding of status values on the non-NBC
leads to noise on kernel logs like so:

  EDAC DEBUG: amd64_inject_write_store: section=0x80000000 word_bits=0x10020001
  [Hardware Error]: Corrected error, no action required.
  [Hardware Error]: CPU:25 (15:2:0) MC4_STATUS[-|CE|-|-|-
  [Hardware Error]: Corrected error, no action required.
  [Hardware Error]: CPU:26 (15:2:0) MC4_STATUS[-|CE|-|-|-
  <...>
  WARNING: CPU: 25 PID: 0 at drivers/edac/amd64_edac.c:2147 decode_bus_error+0x1ba/0x2a0()
  WARNING: CPU: 26 PID: 0 at drivers/edac/amd64_edac.c:2147 decode_bus_error+0x1ba/0x2a0()
  Something is rotten in the state of Denmark.

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Link: http://lkml.kernel.org/r/1422896561-7695-1-git-send-email-aravind.gopalakrishnan@amd.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 13:24:47 +01:00
Derek Che
8af7043a3c x86/MCE: Make mce_panic() fatal machine check msg in the same pattern
There is another mce_panic call with "Fatal machine check on current CPU" in
the same mce.c file, why not keep them all in same pattern

	mce_panic("Fatal machine check on current CPU", &m, msg);

Signed-off-by: Derek Che <drc@yahoo-inc.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 13:24:47 +01:00
Borislav Petkov
3f2f0680d1 x86/MCE/intel: Cleanup CMCI storm logic
Initially, this started with the yet another report about a race
condition in the CMCI storm adaptive period length thing. Yes, we have
to admit, it is fragile and error prone. So let's simplify it.

The simpler logic is: now, after we enter storm mode, we go straight to
polling with CMCI_STORM_INTERVAL, i.e. once a second. We remain in storm
mode as long as we see errors being logged while polling.

Theoretically, if we see an uninterrupted error stream, we will remain
in storm mode indefinitely and keep polling the MSRs.

However, when the storm is actually a burst of errors, once we have
logged them all, we back out of it after ~5 mins of polling and no more
errors logged.

If we encounter an error during those 5 minutes, we reset the polling
interval to 5 mins.

Making machine_check_poll() return a bool and denoting whether it has
seen an error or not lets us simplify a bunch of code and move the storm
handling private to mce_intel.c.

Some minor cleanups while at it.

Reported-by: Calvin Owens <calvinowens@fb.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1417746575-23299-1-git-send-email-calvinowens@fb.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 13:24:25 +01:00
Quentin Casasnovas
35a9ff4eec x86/microcode/intel: Handle truncated microcode images more robustly
We do not check the input data bounds containing the microcode before
copying a struct microcode_intel_header from it. A specially crafted
microcode could cause the kernel to read invalid memory and lead to a
denial-of-service.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1422964824-22056-3-git-send-email-quentin.casasnovas@oracle.com
[ Made error message differ from the next one and flipped comparison. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 12:42:23 +01:00
Quentin Casasnovas
f84598bd7c x86/microcode/intel: Guard against stack overflow in the loader
mc_saved_tmp is a static array allocated on the stack, we need to make
sure mc_saved_count stays within its bounds, otherwise we're overflowing
the stack in _save_mc(). A specially crafted microcode header could lead
to a kernel crash or potentially kernel execution.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1422964824-22056-1-git-send-email-quentin.casasnovas@oracle.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 12:41:37 +01:00
Ingo Molnar
a267b0a349 Merge branch 'tip-x86-kaslr' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent
Pull ASLR and kASLR fixes from Borislav Petkov:

  - Add a global flag announcing KASLR state so that relevant code can do
    informed decisions based on its setting. (Jiri Kosina)

  - Fix a stack randomization entropy decrease bug. (Hector Marco-Gisbert)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19 12:31:34 +01:00
Jan Beulich
2cd4c303a7 x86/MCE/AMD: Drop bogus const modifier from AMD's bank4_names()
The compiler validly warns about it being ignored.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/54C21511020000780005890E@mail.emea.novell.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 12:30:47 +01:00
Jiri Kosina
f47233c2d3 x86/mm/ASLR: Propagate base load address calculation
Commit:

  e2b32e6785 ("x86, kaslr: randomize module base load address")

makes the base address for module to be unconditionally randomized in
case when CONFIG_RANDOMIZE_BASE is defined and "nokaslr" option isn't
present on the commandline.

This is not consistent with how choose_kernel_location() decides whether
it will randomize kernel load base.

Namely, CONFIG_HIBERNATION disables kASLR (unless "kaslr" option is
explicitly specified on kernel commandline), which makes the state space
larger than what module loader is looking at. IOW CONFIG_HIBERNATION &&
CONFIG_RANDOMIZE_BASE is a valid config option, kASLR wouldn't be applied
by default in that case, but module loader is not aware of that.

Instead of fixing the logic in module.c, this patch takes more generic
aproach. It introduces a new bootparam setup data_type SETUP_KASLR and
uses that to pass the information whether kaslr has been applied during
kernel decompression, and sets a global 'kaslr_enabled' variable
accordingly, so that any kernel code (module loading, livepatching, ...)
can make decisions based on its value.

x86 module loader is converted to make use of this flag.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: https://lkml.kernel.org/r/alpine.LNX.2.00.1502101411280.10719@pobox.suse.cz
[ Always dump correct kaslr status when panicking ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:38:54 +01:00
Ingo Molnar
f353e61230 Merge branch 'tip-x86-fpu' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/fpu
Pull FPU updates from Borislav Petkov:

 "A round of updates to the FPU maze from Oleg and Rik. It should make
  the code a bit more understandable/readable/streamlined and a preparation
  for more cleanups and improvements in that area."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19 11:19:05 +01:00
Rik van Riel
6a5fe8952b x86/fpu: Use task_disable_lazy_fpu_restore() helper
Replace magic assignments of fpu.last_cpu = ~0 with more explicit
task_disable_lazy_fpu_restore() calls.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1423252925-14451-8-git-send-email-riel@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:15:55 +01:00
Oleg Nesterov
08a744c6bf x86/fpu: Change math_error() to use unlazy_fpu(), kill (now) unused save_init_fpu()
math_error() calls save_init_fpu() after conditional_sti(), this means
that the caller can be preempted. If !use_eager_fpu() we can hit the
WARN_ON_ONCE(!__thread_has_fpu(tsk)) and/or save the wrong FPU state.

Change math_error() to use unlazy_fpu() and kill save_init_fpu().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1423252925-14451-4-git-send-email-riel@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:15:03 +01:00
Oleg Nesterov
1a2a7f4ec8 x86/fpu: Don't do __thread_fpu_end() if use_eager_fpu()
unlazy_fpu()->__thread_fpu_end() doesn't look right if use_eager_fpu().
Unconditional __thread_fpu_end() is only correct if we know that this
thread can't return to user-mode and use FPU.

Fortunately it has only 2 callers. fpu_copy() checks use_eager_fpu(),
and init_fpu(current) can be only called by the coredumping thread via
regset->get(). But it is exported to modules, and imo this should be
fixed anyway.

And if we check use_eager_fpu() we can use __save_fpu() like fpu_copy()
and save_init_fpu() do.

- It seems that even !use_eager_fpu() case doesn't need the unconditional
  __thread_fpu_end(), we only need it if __save_init_fpu() returns 0.

- It is still not clear to me if __save_init_fpu() can safely nest with
  another save + restore from __kernel_fpu_begin(). If not, we can use
  kernel_fpu_disable() to fix the race.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1423252925-14451-3-git-send-email-riel@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:12:46 +01:00
Oleg Nesterov
a9241ea5fd x86/fpu: Don't reset thread.fpu_counter
The "else" branch clears ->fpu_counter as a remnant of the lazy FPU
usage counting:

  e07e23e1fd ("[PATCH] non lazy "sleazy" fpu implementation")

However, switch_fpu_prepare() does this now so that else branch is
superfluous.

If we do use_eager_fpu(), then this has no effect. Otherwise, if we
actually wanted to prevent fpu preload after the context switch we would
need to reset it unconditionally, even if __thread_has_fpu().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1423252925-14451-2-git-send-email-riel@redhat.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:12:40 +01:00