When unwinding a task, the end of the stack is always at the same offset
right below the saved pt_regs, regardless of which syscall was used to
enter the kernel. That convention allows the unwinder to verify that a
stack is sane.
However, newly forked tasks don't always follow that convention, as
reported by the following unwinder warning seen by Dave Jones:
WARNING: kernel stack frame pointer at ffffc90001443f30 in kworker/u8:8:30468 has bad value (null)
The warning was due to the following call chain:
(ftrace handler)
call_usermodehelper_exec_async+0x5/0x140
ret_from_fork+0x22/0x30
The problem is that ret_from_fork() doesn't create a stack frame before
calling other functions. Fix that by carefully using the frame pointer
macros.
In addition to conforming to the end of stack convention, this also
makes related stack traces more sensible by making it clear to the user
that ret_from_fork() was involved.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8854cdaab980e9700a81e9ebf0d4238e4bbb68ef.1483978430.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With frame pointers, when a task is interrupted, its stack is no longer
completely reliable because the function could have been interrupted
before it had a chance to save the previous frame pointer on the stack.
So the caller of the interrupted function could get skipped by a stack
trace.
This is problematic for live patching, which needs to know whether a
stack trace of a sleeping task can be relied upon. There's currently no
way to detect if a sleeping task was interrupted by a page fault
exception or preemption before it went to sleep.
Another issue is that when dumping the stack of an interrupted task, the
unwinder has no way of knowing where the saved pt_regs registers are, so
it can't print them.
This solves those issues by encoding the pt_regs pointer in the frame
pointer on entry from an interrupt or an exception.
This patch also updates the unwinder to be able to decode it, because
otherwise the unwinder would be broken by this change.
Note that this causes a change in the behavior of the unwinder: each
instance of a pt_regs on the stack is now considered a "frame". So
callers of unwind_get_return_address() will now get an occasional
'regs->ip' address that would have previously been skipped over.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8b9f84a21e39d249049e0547b559ff8da0df0988.1476973742.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull kbuild updates from Michal Marek:
- EXPORT_SYMBOL for asm source by Al Viro.
This does bring a regression, because genksyms no longer generates
checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
working on a patch to fix this.
Plus, we are talking about functions like strcpy(), which rarely
change prototypes.
- Fixes for PPC fallout of the above by Stephen Rothwell and Nick
Piggin
- fixdep speedup by Alexey Dobriyan.
- preparatory work by Nick Piggin to allow architectures to build with
-ffunction-sections, -fdata-sections and --gc-sections
- CONFIG_THIN_ARCHIVES support by Stephen Rothwell
- fix for filenames with colons in the initramfs source by me.
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits)
initramfs: Escape colons in depfile
ppc: there is no clear_pages to export
powerpc/64: whitelist unresolved modversions CRCs
kbuild: -ffunction-sections fix for archs with conflicting sections
kbuild: add arch specific post-link Makefile
kbuild: allow archs to select link dead code/data elimination
kbuild: allow architectures to use thin archives instead of ld -r
kbuild: Regenerate genksyms lexer
kbuild: genksyms fix for typeof handling
fixdep: faster CONFIG_ search
ia64: move exports to definitions
sparc32: debride memcpy.S a bit
[sparc] unify 32bit and 64bit string.h
sparc: move exports to definitions
ppc: move exports to definitions
arm: move exports to definitions
s390: move exports to definitions
m68k: move exports to definitions
alpha: move exports to actual definitions
x86: move exports to actual definitions
...
Pull low-level x86 updates from Ingo Molnar:
"In this cycle this topic tree has become one of those 'super topics'
that accumulated a lot of changes:
- Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
x86 - preceded by an array of changes. v4.8 saw preparatory changes
in this area already - this is the rest of the work. Includes the
thread stack caching performance optimization. (Andy Lutomirski)
- switch_to() cleanups and all around enhancements. (Brian Gerst)
- A large number of dumpstack infrastructure enhancements and an
unwinder abstraction. The secret long term plan is safe(r) live
patching plus maybe another attempt at debuginfo based unwinding -
but all these current bits are standalone enhancements in a frame
pointer based debug environment as well. (Josh Poimboeuf)
- More __ro_after_init and const annotations. (Kees Cook)
- Enable KASLR for the vmemmap memory region. (Thomas Garnier)"
[ The virtually mapped stack changes are pretty fundamental, and not
x86-specific per se, even if they are only used on x86 right now. ]
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
x86/asm: Get rid of __read_cr4_safe()
thread_info: Use unsigned long for flags
x86/alternatives: Add stack frame dependency to alternative_call_2()
x86/dumpstack: Fix show_stack() task pointer regression
x86/dumpstack: Remove dump_trace() and related callbacks
x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
oprofile/x86: Convert x86_backtrace() to use the new unwinder
x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
perf/x86: Convert perf_callchain_kernel() to use the new unwinder
x86/unwind: Add new unwind interface and implementations
x86/dumpstack: Remove NULL task pointer convention
fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
lib/syscall: Pin the task stack in collect_syscall()
x86/process: Pin the target stack in get_wchan()
x86/dumpstack: Pin the target stack when dumping it
kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
sched/core: Add try_get_task_stack() and put_task_stack()
x86/entry/64: Fix a minor comment rebase error
iommu/amd: Don't put completion-wait semaphore on stack
...
This warning:
WARNING: CPU: 0 PID: 3331 at arch/x86/entry/common.c:45 enter_from_user_mode+0x32/0x50
CPU: 0 PID: 3331 Comm: ldt_gdt_64 Not tainted 4.8.0-rc7+ #13
Call Trace:
dump_stack+0x99/0xd0
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
enter_from_user_mode+0x32/0x50
error_entry+0x6d/0xc0
? general_protection+0x12/0x30
? native_load_gs_index+0xd/0x20
? do_set_thread_area+0x19c/0x1f0
SyS_set_thread_area+0x24/0x30
do_int80_syscall_32+0x7c/0x220
entry_INT80_compat+0x38/0x50
... can be reproduced by running the GS testcase of the ldt_gdt test unit in
the x86 selftests.
do_int80_syscall_32() will call enter_form_user_mode() to convert context
tracking state from user state to kernel state. The load_gs_index() call
can fail with user gsbase, gsbase will be fixed up and proceed if this
happen.
However, enter_from_user_mode() will be called again in the fixed up path
though it is context tracking kernel state currently.
This patch fixes it by just fixing up gsbase and telling lockdep that IRQs
are off once load_gs_index() failed with user gsbase.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1475197266-3440-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As EBS does not mean anything reasonable in the context it is used, it
seems like a misspelling for EBX.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
I was fishing RIP (i.e. RCX) out of pt_regs->cx and RFLAGS (i.e.
R11) out of pt_regs->r11. While it usually worked (pt_regs
started out with CX == IP and R11 == FLAGS), it was very
fragile. In particular, it broke sys_iopl() because sys_iopl()
forgot to mark itself as using ptregs.
Undo that part of the syscall rework. There was no compelling
reason to do it this way. While I'm at it, load RCX and R11
before the other regs to be a little friendlier to the CPU, as
they will be the first of the reloaded registers to be used.
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1e423bff95 x86/entry/64: ("Migrate the 64-bit syscall slow path to C")
Link: http://lkml.kernel.org/r/a85f8360c397e48186a9bc3e565ad74307a7b011.1454261517.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
64-bit syscalls currently have an optimization in which they are
called with partial pt_regs. A small handful require full
pt_regs.
In the 32-bit and compat cases, I cleaned this up by forcing
full pt_regs for all syscalls. The performance hit doesn't
really matter as the affected system calls are fundamentally
heavy and this is the 32-bit compat case.
I want to clean up the 64-bit case as well, but I don't want to
hurt fast path performance. To do that, I want to force the
syscalls that use pt_regs onto the slow path. This will enable
us to make slow path syscalls be real ABI-compliant C functions.
Use the new syscall entry qualification machinery for this.
'stub_clone' is now 'stub_clone/ptregs'.
The next patch will eliminate the stubs, and we'll just have
'sys_clone/ptregs'.
As of this patch, two-phase entry tracing is no longer used. It
has served its purpose (namely a huge speedup on some workloads
prior to more general opportunistic SYSRET support), and once
the dust settles I'll send patches to back it out.
The implementation is heavily based on a patch from Brian Gerst:
http://lkml.kernel.org/g/1449666173-15366-1-git-send-email-brgerst@gmail.com
Originally-From: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b9beda88460bcefec6e7d792bd44eca9b760b0c4.1454022279.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Paolo pointed out that enter_from_user_mode could be called
while irqflags were traced as though IRQs were on.
In principle, this could confuse lockdep. It doesn't cause any
problems that I've seen in any configuration, but if I build
with CONFIG_DEBUG_LOCKDEP=y, enable a nohz_full CPU, and add
code like:
if (irqs_disabled()) {
spin_lock(&something);
spin_unlock(&something);
}
to the top of enter_from_user_mode, then lockdep will complain
without this fix. It seems that lockdep's irqflags sanity
checks are too weak to detect this bug without forcing the
issue.
This patch adds one byte to normal kernels, and it's IMO a bit
ugly. I haven't spotted a better way to do this yet, though.
The issue is that we can't do TRACE_IRQS_OFF until after SWAPGS
(if needed), but we're also supposed to do it before calling C
code.
An alternative approach would be to call trace_hardirqs_off in
enter_from_user_mode. That would be less code and would not
bloat normal kernels at all, but it would be harder to see how
the code worked.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/86237e362390dfa6fec12de4d75a238acb0ae787.1447361906.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
example, trimmed for readability):
ff 15 00 00 00 00 callq *0x0(%rip) # 2796 <nmi+0x6>
2792: R_X86_64_PC32 pv_irq_ops+0x2c
That's a call through a function pointer to regular C function that
does nothing on native boots, but that function isn't protected
against kprobes, isn't marked notrace, and is certainly not
guaranteed to preserve any registers if the compiler is feeling
perverse. This is bad news for a CLBR_NONE operation.
Of course, if everything works correctly, once paravirt ops are
patched, it gets nopped out, but what if we hit this code before
paravirt ops are patched in? This can potentially cause breakage
that is very difficult to debug.
A more subtle failure is possible here, too: if _paravirt_nop uses
the stack at all (even just to push RBP), it will overwrite the "NMI
executing" variable if it's called in the NMI prologue.
The Xen case, perhaps surprisingly, is fine, because it's already
written in asm.
Fix all of the cases that default to paravirt_nop (including
adjust_exception_frame) with a big hammer: replace paravirt_nop with
an asm function that is just a ret instruction.
The Xen case may have other problems, so document them.
This is part of a fix for some random crashes that Sasha saw.
Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It turns out to be rather tedious to test the NMI nesting code.
Make it easier: add a new CONFIG_DEBUG_ENTRY option that causes
the NMI handler to pre-emptively unmask NMIs.
With this option set, errors in the repeat_nmi logic or failures
to detect that we're in a nested NMI will result in quick panics
under perf (especially if multiple counters are running at high
frequency) instead of requiring an unusual workload that
generates page faults or breakpoints inside NMIs.
I called it CONFIG_DEBUG_ENTRY instead of CONFIG_DEBUG_NMI_ENTRY
because I want to add new non-NMI checks elsewhere in the entry
code in the future, and I'd rather not add too many new config
options or add this option and then immediately rename it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, "NMI executing" is one the first time an outermost
NMI hits repeat_nmi and zero thereafter. Change it to be zero
each time for consistency.
This is intended to help NMI handling fail harder if it's buggy.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have a tricky bug in the nested NMI code: if we see RSP
pointing to the NMI stack on NMI entry from kernel mode, we
assume that we are executing a nested NMI.
This isn't quite true. A malicious userspace program can point
RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
happen while RSP is still pointing at the NMI stack.
Fix it with a sneaky trick. Set DF in the region of code that
the RSP check is intended to detect. IRET will clear DF
atomically.
( Note: other than paravirt, there's little need for all this
complexity. We could check RIP instead of RSP. )
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Check the repeat_nmi .. end_repeat_nmi special case first. The
next patch will rework the RSP check and, as a side effect, the
RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
we'll need this ordering of the checks.
Note: this is more subtle than it appears. The check for
repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
instead of adjusting the "iret" frame to force a repeat. This
is necessary, because the code between repeat_nmi and
end_repeat_nmi sets "NMI executing" and then writes to the
"iret" frame itself. If a nested NMI comes in and modifies the
"iret" frame while repeat_nmi is also modifying it, we'll end up
with garbage. The old code got this right, as does the new
code, but the new code is a bit more explicit.
If we were to move the check right after the "NMI executing"
check, then we'd get it wrong and have random crashes.
( Because the "NMI executing" check would jump to the code that would
modify the "iret" frame without checking if the interrupted NMI was
currently modifying it. )
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Returning to userspace is tricky: IRET can fail, and ESPFIX can
rearrange the stack prior to IRET.
The NMI nesting fixup relies on a precise stack layout and
atomic IRET. Rather than trying to teach the NMI nesting fixup
to handle ESPFIX and failed IRET, punt: run NMIs that came from
user mode on the normal kernel stack.
This will make some nested NMIs visible to C code, but the C
code is okay with that.
As a side effect, this should speed up perf: it eliminates an
RDMSR when NMIs come from user mode.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>