Change uprobe_init_insn() to make insn_complete() == T, this makes
other insn_get_*() calls unnecessary.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
1. Extract the ->ia32_compat check from 64bit validate_insn_bits()
into the new helper, is_64bit_mm(), it will have more users.
TODO: this checks is actually wrong if mm owner is X32 task,
we need another fix which changes set_personality_ia32().
TODO: even worse, the whole 64-or-32-bit logic is very broken
and the fix is not simple, we need the nontrivial changes in
the core uprobes code.
2. Kill validate_insn_bits() and change its single caller to use
uprobe_init_insn(is_64bit_mm(mm).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
validate_insn_32bits() and validate_insn_64bits() are very similar,
turn them into the single uprobe_init_insn() which has the additional
"bool x86_64" argument which can be passed to insn_init() and used to
choose between good_insns_64/good_insns_32.
Also kill UPROBE_FIX_NONE, it has no users.
Note: the current code doesn't use ifdef's consistently, good_insns_64
depends on CONFIG_X86_64 but good_insns_32 is unconditional. This patch
removes ifdef around good_insns_64, we will add it back later along with
the similar one for good_insns_32.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
All branch insns on x86 can be prefixed with the operand-size
override prefix, 0x66. It was only ever useful for performing
jumps to 32-bit offsets in 16-bit code segments.
In 32-bit code, such instructions are useless since
they cause IP truncation to 16 bits, and in case of call insns,
they save only 16 bits of return address and misalign
the stack pointer as a "bonus".
In 64-bit code, such instructions are treated differently by Intel
and AMD CPUs: Intel ignores the prefix altogether,
AMD treats them the same as in 32-bit mode.
Before this patch, the emulation code would execute
the instructions as if they have no 0x66 prefix.
With this patch, we refuse to attach uprobes to such insns.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Remove the direct accesses to FDT header data using accessor
function instead. This makes the code more readable and makes the FDT
blob structure more opaque to the arch code. This also prepares for
removing struct boot_param_header completely.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Tested-by: Grant Likely <grant.likely@linaro.org>
On x86 the allocation of irq descriptors may allocate interrupts which
are in the range of the GSI interrupts. That's wrong as those
interrupts are hardwired and we don't have the irq domain translation
like PPC. So one of these interrupts can be hooked up later to one of
the devices which are hard wired to it and the io_apic init code for
that particular interrupt line happily reuses that descriptor with a
completely different configuration so hell breaks lose.
Inside x86 we allocate dynamic interrupts from above nr_gsi_irqs,
except for a few usage sites which have not yet blown up in our face
for whatever reason. But for drivers which need an irq range, like the
GPIO drivers, we have no limit in place and we don't want to expose
such a detail to a driver.
To cure this introduce a function which an architecture can implement
to impose a lower bound on the dynamic interrupt allocations.
Implement it for x86 and set the lower bound to nr_gsi_irqs, which is
the end of the hardwired interrupt space, so all dynamic allocations
happen above.
That not only allows the GPIO driver to work sanely, it also protects
the bogus callsites of create_irq_nr() in hpet, uv, irq_remapping and
htirq code. They need to be cleaned up as well, but that's a separate
issue.
Reported-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Krogerus Heikki <heikki.krogerus@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1404241617360.28206@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Correct IRQ routing in case a vSMP box is detected
but the Interrupt Routing Comply (IRC) value is set to
"comply", which leads to incorrect IRQ routing.
Before the patch:
When a vSMP box was detected and IRC was set to "comply",
users (and the kernel) couldn't effectively set the
destination of the IRQs. This is because the hook inside
vsmp_64.c always setup all CPUs as the IRQ destination using
cpumask_setall() as the return value for IRQ allocation mask.
Later, this "overrided" mask caused the kernel to set the IRQ
destination to the lowest online CPU in the mask (CPU0 usually).
After the patch:
When the IRC is set to "comply", users (and the kernel) can control
the destination of the IRQs as we will not be changing the
default "apic->vector_allocation_domain".
Signed-off-by: Oren Twaig <oren@scalemp.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Link: http://lkml.kernel.org/r/1398669697-2123-1-git-send-email-oren@scalemp.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to prohibit probing on the functions
used in preparation phase. Those are safely probed because
those are not invoked from breakpoint/fault/debug handlers,
there is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
can_boost
can_probe
can_optimize
is_IF_modifier
__copy_instruction
copy_optimized_instructions
arch_copy_kprobe
arch_prepare_kprobe
arch_arm_kprobe
arch_disarm_kprobe
arch_remove_kprobe
arch_trampoline_kprobe
arch_prepare_kprobe_ftrace
arch_prepare_optimized_kprobe
arch_check_optimized_kprobe
arch_within_optimized_kprobe
__arch_remove_optimized_kprobe
arch_remove_optimized_kprobe
arch_optimize_kprobes
arch_unoptimize_kprobe
I tested those functions by putting kprobes on all
instructions in the functions with the bash script
I sent to LKML. See:
https://lkml.org/lkml/2014/3/27/33
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Link: http://lkml.kernel.org/r/20140417081747.26341.36065.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To avoid a kernel crash by probing on lockdep code, call
kprobe_int3_handler() and kprobe_debug_handler()(which was
formerly called post_kprobe_handler()) directly from
do_int3 and do_debug.
Currently kprobes uses notify_die() to hook the int3/debug
exceptoins. Since there is a locking code in notify_die,
the lockdep code can be invoked. And because the lockdep
involves printk() related things, theoretically, we need to
prohibit probing on such code, which means much longer blacklist
we'll have. Instead, hooking the int3/debug for kprobes before
notify_die() can avoid this problem.
Anyway, most of the int3 handlers in the kernel are already
called from do_int3 directly, e.g. ftrace_int3_handler,
poke_int3_handler, kgdb_ll_trap. Actually only
kprobe_exceptions_notify is on the notifier_call_chain.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081733.26341.24423.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the NMI handlers(e.g. perf) can interrupt in the
single stepping (or preparing the single stepping, do_debug
etc.), we should consider a kprobe is hit in the NMI
handler. Even in that case, the kprobe is allowed to be
reentered as same as the kprobes hit in kprobe handlers
(KPROBE_HIT_ACTIVE or KPROBE_HIT_SSDONE).
The real issue will happen when a kprobe hit while another
reentered kprobe is processing (KPROBE_REENTER), because
we already consumed a saved-area for the previous kprobe.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Link: http://lkml.kernel.org/r/20140417081651.26341.10593.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fix from Ingo Molnar:
"This fixes the preemption-count imbalance crash reported by Owen
Kibel"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix CMCI preemption bugs
Change branch_setup_xol_ops() to simply use opc1 = OPCODE2(insn) - 0x10
if OPCODE1() == 0x0f; this matches the "short" jmp which checks the same
condition.
Thanks to lib/insn.c, it does the rest correctly. branch->ilen/offs are
correct no matter if this jmp is "near" or "short".
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Teach branch_emulate_op() to emulate the conditional "short" jmp's which
check regs->flags.
Note: this doesn't support jcxz/jcexz, loope/loopz, and loopne/loopnz.
They all are rel8 and thus they can't trigger the problem, but perhaps
we will add the support in future just for completeness.
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
See the previous "Emulate unconditional relative jmp's" which explains
why we can not execute "jmp" out-of-line, the same applies to "call".
Emulating of rip-relative call is trivial, we only need to additionally
push the ret-address. If this fails, we execute this instruction out of
line and this should trigger the trap, the probed application should die
or the same insn will be restarted if a signal handler expands the stack.
We do not even need ->post_xol() for this case.
But there is a corner (and almost theoretical) case: another thread can
expand the stack right before we execute this insn out of line. In this
case it hit the same problem we are trying to solve. So we simply turn
the probed insn into "call 1f; 1:" and add ->post_xol() which restores
->sp and restarts.
Many thanks to Jonathan who finally found the standalone reproducer,
otherwise I would never resolve the "random SIGSEGV's under systemtap"
bug-report. Now that the problem is clear we can write the simplified
test-case:
void probe_func(void), callee(void);
int failed = 1;
asm (
".text\n"
".align 4096\n"
".globl probe_func\n"
"probe_func:\n"
"call callee\n"
"ret"
);
/*
* This assumes that:
*
* - &probe_func = 0x401000 + a_bit, aligned = 0x402000
*
* - xol_vma->vm_start = TASK_SIZE_MAX - PAGE_SIZE = 0x7fffffffe000
* as xol_add_vma() asks; the 1st slot = 0x7fffffffe080
*
* so we can target the non-canonical address from xol_vma using
* the simple math below, 100 * 4096 is just the random offset
*/
asm (".org . + 0x800000000000 - 0x7fffffffe080 - 5 - 1 + 100 * 4096\n");
void callee(void)
{
failed = 0;
}
int main(void)
{
probe_func();
return failed;
}
It SIGSEGV's if you probe "probe_func" (although this is not very reliable,
randomize_va_space/etc can change the placement of xol area).
Note: as Denys Vlasenko pointed out, amd and intel treat "callw" (0x66 0xe8)
differently. This patch relies on lib/insn.c and thus implements the intel's
behaviour: 0x66 is simply ignored. Fortunately nothing sane should ever use
this insn, so we postpone the fix until we decide what should we do; emulate
or not, support or not, etc.
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Finally we can kill the ugly (and very limited) code in __skip_sstep().
Just change branch_setup_xol_ops() to treat "nop" as jmp to the next insn.
Thanks to lib/insn.c, it is clever enough. OPCODE1() == 0x90 includes
"(rep;)+ nop;" at least, and (afaics) much more.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Currently we always execute all insns out-of-line, including relative
jmp's and call's. This assumes that even if regs->ip points to nowhere
after the single-step, default_post_xol_op(UPROBE_FIX_IP) logic will
update it correctly.
However, this doesn't work if this regs->ip == xol_vaddr + insn_offset
is not canonical. In this case CPU generates #GP and general_protection()
kills the task which tries to execute this insn out-of-line.
Now that we have uprobe_xol_ops we can teach uprobes to emulate these
insns and solve the problem. This patch adds branch_xol_ops which has
a single branch_emulate_op() hook, so far it can only handle rel8/32
relative jmp's.
TODO: move ->fixup into the union along with rip_rela_target_address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
1. Add the trivial sizeof_long() helper and change other callers of
is_ia32_task() to use it.
TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
not necessarily mean 32bit. Fortunately syscall-like insns can't be
probed so it actually works, but it would be better to rename and
use is_ia32_frame().
2. As Jim pointed out "ncopied" in arch_uretprobe_hijack_return_addr()
and adjust_ret_addr() should be named "nleft". And in fact only the
last copy_to_user() in arch_uretprobe_hijack_return_addr() actually
needs to inspect the non-zero error code.
TODO: adjust_ret_addr() should die. We can always calculate the value
we need to write into *regs->sp, just UPROBE_FIX_CALL should record
insn->length.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
SIGILL after the failed arch_uprobe_post_xol() should only be used as
a last resort, we should try to restart the probed insn if possible.
Currently only adjust_ret_addr() can fail, and this can only happen if
another thread unmapped our stack after we executed "call" out-of-line.
Most probably the application if buggy, but even in this case it can
have a handler for SIGSEGV/etc. And in theory it can be even correct
and do something non-trivial with its memory.
Of course we can't restart unconditionally, so arch_uprobe_post_xol()
does this only if ->post_xol() returns -ERESTART even if currently this
is the only possible error.
default_post_xol_op(UPROBE_FIX_CALL) can always restart, but as Jim
pointed out it should not forget to pop off the return address pushed
by this insn executed out-of-line.
Note: this is not "perfect", we do not want the extra handler_chain()
after restart, but I think this is the best solution we can realistically
do without too much uglifications.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Currently the error from arch_uprobe_post_xol() is silently ignored.
This doesn't look good and this can lead to the hard-to-debug problems.
1. Change handle_singlestep() to loudly complain and send SIGILL.
Note: this only affects x86, ppc/arm can't fail.
2. Change arch_uprobe_post_xol() to call arch_uprobe_abort_xol() and
avoid TF games if it is going to return an error.
This can help to to analyze the problem, if nothing else we should
not report ->ip = xol_slot in the core-file.
Note: this means that handle_riprel_post_xol() can be called twice,
but this is fine because it is idempotent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
arch_uprobe_analyze_insn() calls handle_riprel_insn() at the start,
but only "0xff" and "default" cases need the UPROBE_FIX_RIP_ logic.
Move the callsite into "default" case and change the "0xff" case to
fall-through.
We are going to add the various hooks to handle the rip-relative
jmp/call instructions (and more), we need this change to enforce the
fact that the new code can not conflict with is_riprel_insn() logic
which, after this change, can only be used by default_xol_ops.
Note: arch_uprobe_abort_xol() still calls handle_riprel_post_xol()
directly. This is fine unless another _xol_ops we may add later will
need to reuse "UPROBE_FIX_RIP_AX|UPROBE_FIX_RIP_CX" bits in ->fixup.
In this case we can add uprobe_xol_ops->abort() hook, which (perhaps)
we will need anyway in the long term.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Introduce arch_uprobe->ops pointing to the "struct uprobe_xol_ops",
move the current UPROBE_FIX_{RIP*,IP,CALL} code into the default
set of methods and change arch_uprobe_pre/post_xol() accordingly.
This way we can add the new uprobe_xol_ops's to handle the insns
which need the special processing (rip-relative jmp/call at least).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
No functional changes. Preparation to simplify the review of the next
change. Just reorder the code in arch_uprobe_pre/post_xol() functions
so that UPROBE_FIX_{RIP_*,IP,CALL} logic goes to the end.
Also change arch_uprobe_pre_xol() to use utask instead of autask, to
make the code more symmetrical with arch_uprobe_post_xol().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cosmetic. Move pre_xol_rip_insn() and handle_riprel_post_xol() up to
the closely related handle_riprel_insn(). This way it is simpler to
read and understand this code, and this lessens the number of ifdef's.
While at it, update the comment in handle_riprel_post_xol() as Jim
suggested.
TODO: rename them somehow to make the naming consistent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Kill the "mm->context.ia32_compat" check in handle_riprel_insn(), if
it is true insn_rip_relative() must return false. validate_insn_bits()
passed "ia32_compat" as !x86_64 to insn_init(), and insn_rip_relative()
checks insn->x86_64.
Also, remove the no longer needed "struct mm_struct *mm" argument and
the unnecessary "return" at the end.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
No functional changes, preparation.
Shift the code from prepare_fixups() to arch_uprobe_analyze_insn()
with the following modifications:
- Do not call insn_get_opcode() again, it was already called
by validate_insn_bits().
- Move "case 0xea" up. This way "case 0xff" can fall through
to default case.
- change "case 0xff" to use the nested "switch (MODRM_REG)",
this way the code looks a bit simpler.
- Make the comments look consistent.
While at it, kill the initialization of rip_rela_target_address and
->fixups, we can rely on kzalloc(). We will add the new members into
arch_uprobe, it would be better to assume that everything is zero by
default.
TODO: cleanup/fix the mess in validate_insn_bits() paths:
- validate_insn_64bits() and validate_insn_32bits() should be
unified.
- "ifdef" is not used consistently; if good_insns_64 depends
on CONFIG_X86_64, then probably good_insns_32 should depend
on CONFIG_X86_32/EMULATION
- the usage of mm->context.ia32_compat looks wrong if the task
is TIF_X32.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Current kprobes in-kernel page fault handler doesn't
expect that its single-stepping can be interrupted by
an NMI handler which may cause a page fault(e.g. perf
with callback tracing).
In that case, the page-fault handled by kprobes and it
misunderstands the page-fault has been caused by the
single-stepping code and tries to recover IP address
to probed address.
But the truth is the page-fault has been caused by the
NMI handler, and do_page_fault failes to handle real
page fault because the IP address is modified and
causes Kernel BUGs like below.
----
[ 2264.726905] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 2264.727190] IP: [<ffffffff813c46e0>] copy_user_generic_string+0x0/0x40
To handle this correctly, I fixed the kprobes fault
handler to ensure the faulted ip address is its own
single-step buffer instead of checking current kprobe
state.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Sandeepa Prabhu <sandeepa.prabhu@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fche@redhat.com
Cc: systemtap@sourceware.org
Link: http://lkml.kernel.org/r/20140417081644.26341.52351.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
27f6c573e0 ("x86, CMCI: Add proper detection of end of CMCI storms")
Added two preemption bugs:
- machine_check_poll() does a get_cpu_var() without a matching
put_cpu_var(), which causes preemption imbalance and crashes upon
bootup.
- it does percpu ops without disabling preemption. Preemption is not
disabled due to the mistaken use of a raw spinlock.
To fix these bugs fix the imbalance and change
cmci_discover_lock to a regular spinlock.
Reported-by: Owen Kibel <qmewlo@gmail.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chen, Gong <gong.chen@linux.intel.com>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Todorov <atodorov@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/n/tip-jtjptvgigpfkpvtQxpEk1at2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
--
arch/x86/kernel/cpu/mcheck/mce.c | 4 +---
arch/x86/kernel/cpu/mcheck/mce_intel.c | 18 +++++++++---------
2 files changed, 10 insertions(+), 12 deletions(-)
Pull x86 fixes from Ingo Molnar:
"Various fixes:
- reboot regression fix
- build message spam fix
- GPU quirk fix
- 'make kvmconfig' fix
plus the wire-up of the renameat2() system call on i386"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Remove the PCI reboot method from the default chain
x86/build: Supress "Nothing to be done for ..." messages
x86/gpu: Fix sign extension issue in Intel graphics stolen memory quirks
x86/platform: Fix "make O=dir kvmconfig"
i386: Wire up the renameat2() syscall
Several patches to fix cpu hotplug and the down'd cpu's irq
relocations have been submitted in the past month or so. The
patches should resolve the problems with cpu hotplug and irq
relocation, however, there is always a possibility that a bug
still exists. The big problem with debugging these irq
reassignments is that the cpu down completes and then we get
random stack traces from drivers for which irqs have not been
properly assigned to a new cpu. The stack traces are a mix of
storage, network, and other kernel subsystem (I once saw the
serial port stop working ...) warnings and failures.
The problem with these failures is that they are difficult to
diagnose. There is no warning in the cpu hotplug down path to
indicate that an IRQ has failed to be assigned to a new cpu, and
all we are left with is a stack trace from a driver, or a
non-functional device. If we had some information on the
console debugging these situations would be much easier; after
all we can map an IRQ to a device by simply using lspci or
/proc/interrupts.
The current code, fixup_irqs(), which migrates IRQs from the
down'd cpu and is called close to the end of the cpu down path,
calls chip->set_irq_affinity which eventually calls
__assign_irq_vector(). Errors are not propogated back from this
function call and this results in silent irq relocation
failures.
This patch fixes this issue by returning the error codes up the
call stack and prints out a warning if there is a relocation
failure.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rui Wang <rui.y.wang@intel.com>
Cc: Liu Ping Fan <kernelfans@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Li Fei <fei.li@intel.com>
Cc: gong.chen@linux.intel.com
Link: http://lkml.kernel.org/r/1396440673-18286-1-git-send-email-prarit@redhat.com
[ Made small cleanliness tweaks. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve reported a reboot hang and bisected it back to this commit:
a4f1987e4c x86, reboot: Add EFI and CF9 reboot methods into the default list
He heroically tested all reboot methods and found the following:
reboot=t # triple fault ok
reboot=k # keyboard ctrl FAIL
reboot=b # BIOS ok
reboot=a # ACPI FAIL
reboot=e # EFI FAIL [system has no EFI]
reboot=p # PCI 0xcf9 FAIL
And I think it's pretty obvious that we should only try PCI 0xcf9 as a
last resort - if at all.
The other observation is that (on this box) we should never try
the PCI reboot method, but close with either the 'triple fault'
or the 'BIOS' (terminal!) reboot methods.
Thirdly, CF9_COND is a total misnomer - it should be something like
CF9_SAFE or CF9_CAREFUL, and 'CF9' should be 'CF9_FORCE' ...
So this patch fixes the worst problems:
- it orders the actual reboot logic to follow the reboot ordering
pattern - it was in a pretty random order before for no good
reason.
- it fixes the CF9 misnomers and uses BOOT_CF9_FORCE and
BOOT_CF9_SAFE flags to make the code more obvious.
- it tries the BIOS reboot method before the PCI reboot method.
(Since 'BIOS' is a terminal reboot method resulting in a hang
if it does not work, this is essentially equivalent to removing
the PCI reboot method from the default reboot chain.)
- just for the miraculous possibility of terminal (resulting
in hang) reboot methods of triple fault or BIOS returning
without having done their job, there's an ordering between
them as well.
Reported-and-bisected-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Aubrey <aubrey.li@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Link: http://lkml.kernel.org/r/20140404064120.GB11877@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The legacy PIC may or may not be available and we need a mechanism to
detect the existence of the legacy PIC that is applicable for all
hardware (both physical as well as virtual) currently supported by
Linux.
On Hyper-V, when our legacy firmware presented to the guests, emulates
the legacy PIC while when our EFI based firmware is presented we do
not emulate the PIC. To support Hyper-V EFI firmware, we had to set
the legacy_pic to the null_legacy_pic since we had to bypass PIC based
calibration in the early boot code. While, on the EFI firmware, we
know we don't emulate the legacy PIC, we need a generic mechanism to
detect the presence of the legacy PIC that is not based on boot time
state - this became apparent when we tried to get kexec to work on
Hyper-V EFI firmware.
This patch implements the proposal put forth by H. Peter Anvin
<hpa@linux.intel.com>: Write a known value to the PIC data port and
read it back. If the value read is the value written, we do have the
PIC, if not there is no PIC and we can safely set the legacy_pic to
null_legacy_pic. Since the read from an unconnected I/O port returns
0xff, we will use ~(1 << PIC_CASCADE_IR) (0xfb: mask all lines except
the cascade line) to probe for the existence of the PIC.
In version V1 of the patch, I had cleaned up the code based on comments from Peter.
In version V2 of the patch, I have addressed additional comments from Peter.
In version V3 of the patch, I have addressed Jan's comments (JBeulich@suse.com).
In version V4 of the patch, I have addressed additional comments from Peter.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Link: http://lkml.kernel.org/r/1397501029-29286-1-git-send-email-kys@microsoft.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Have the KB(),MB(),GB() macros produce unsigned longs to avoid
unintended sign extension issues with the gen2 memory size
detection.
What happens is first the uint8_t returned by
read_pci_config_byte() gets promoted to an int which gets
multiplied by another int from the MB() macro, and finally the
result gets sign extended to size_t.
Although this shouldn't be a problem in practice as all affected
gen2 platforms are 32bit AFAIK, so size_t will be 32 bits.
Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1397382303-17525-1-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more ACPI and power management fixes and updates from Rafael Wysocki:
"This is PM and ACPI material that has emerged over the last two weeks
and one fix for a CPU hotplug regression introduced by the recent CPU
hotplug notifiers registration series.
Included are intel_idle and turbostat updates from Len Brown (these
have been in linux-next for quite some time), a new cpufreq driver for
powernv (that might spend some more time in linux-next, but BenH was
asking me so nicely to push it for 3.15 that I couldn't resist), some
cpufreq fixes and cleanups (including fixes for some silly breakage in
a couple of cpufreq drivers introduced during the 3.14 cycle),
assorted ACPI cleanups, wakeup framework documentation fixes, a new
sysfs attribute for cpuidle and a new command line argument for power
domains diagnostics.
Specifics:
- Fix for a recently introduced CPU hotplug regression in ARM KVM
from Ming Lei.
- Fixes for breakage in the at32ap, loongson2_cpufreq, and unicore32
cpufreq drivers introduced during the 3.14 cycle (-stable material)
from Chen Gang and Viresh Kumar.
- New powernv cpufreq driver from Vaidyanathan Srinivasan, with bits
from Gautham R Shenoy and Srivatsa S Bhat.
- Exynos cpufreq driver fix preventing it from being included into
multiplatform builds that aren't supported by it from Sachin Kamat.
- cpufreq cleanups related to the usage of the driver_data field in
struct cpufreq_frequency_table from Viresh Kumar.
- cpufreq ppc driver cleanup from Sachin Kamat.
- Intel BayTrail support for intel_idle and ACPI idle from Len Brown.
- Intel CPU model 54 (Atom N2000 series) support for intel_idle from
Jan Kiszka.
- intel_idle fix for Intel Ivy Town residency targets from Len Brown.
- turbostat updates (Intel Broadwell support and output cleanups)
from Len Brown.
- New cpuidle sysfs attribute for exporting C-states' target
residency information to user space from Daniel Lezcano.
- New kernel command line argument to prevent power domains enabled
by the bootloader from being turned off even if they are not in use
(for diagnostics purposes) from Tushar Behera.
- Fixes for wakeup sysfs attributes documentation from Geert
Uytterhoeven.
- New ACPI video blacklist entry for ThinkPad Helix from Stephen
Chandler Paul.
- Assorted ACPI cleanups and a Kconfig help update from Jonghwan
Choi, Zhihui Zhang, Hanjun Guo"
* tag 'pm+acpi-3.15-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (28 commits)
ACPI: Update the ACPI spec information in Kconfig
arm, kvm: fix double lock on cpu_add_remove_lock
cpuidle: sysfs: Export target residency information
cpufreq: ppc: Remove duplicate inclusion of fsl_soc.h
cpufreq: create another field .flags in cpufreq_frequency_table
cpufreq: use kzalloc() to allocate memory for cpufreq_frequency_table
cpufreq: don't print value of .driver_data from core
cpufreq: ia64: don't set .driver_data to index
cpufreq: powernv: Select CPUFreq related Kconfig options for powernv
cpufreq: powernv: Use cpufreq_frequency_table.driver_data to store pstate ids
cpufreq: powernv: cpufreq driver for powernv platform
cpufreq: at32ap: don't declare local variable as static
cpufreq: loongson2_cpufreq: don't declare local variable as static
cpufreq: unicore32: fix typo issue for 'clk'
cpufreq: exynos: Disable on multiplatform build
PM / wakeup: Correct presence vs. emptiness of wakeup_* attributes
PM / domains: Add pd_ignore_unused to keep power domains enabled
ACPI / dock: Drop dock_device_ids[] table
ACPI / video: Favor native backlight interface for ThinkPad Helix
ACPI / thermal: Fix wrong variable usage in debug statement
...
Pullx86 core platform updates from Peter Anvin:
"This is the x86/platform branch with the objectionable IOSF patches
removed.
What is left is proper memory handling for Intel GPUs, and a change to
the Calgary IOMMU code which will be required to make kexec work
sanely on those platforms after some upcoming kexec changes"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, calgary: Use 8M TCE table size by default
x86/gpu: Print the Intel graphics stolen memory range
x86/gpu: Add Intel graphics stolen memory quirk for gen2 platforms
x86/gpu: Add vfunc for Intel graphics stolen memory base address
Pull x86 fixes from Peter Anvin:
"This is a collection of minor fixes for x86, plus the IRET information
leak fix (forbid the use of 16-bit segments in 64-bit mode)"
NOTE! We may have to relax the "forbid the use of 16-bit segments in
64-bit mode" part, since there may be people who still run and depend on
16-bit Windows binaries under Wine.
But I'm taking this in the current unconditional form for now to see who
(if anybody) screams bloody murder. Maybe nobody cares. And maybe
we'll have to update it with some kind of runtime enablement (like our
vm.mmap_min_addr tunable that people who run dosemu/qemu/wine already
need to tweak).
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
efi: Pass correct file handle to efi_file_{read,close}
x86/efi: Correct EFI boot stub use of code32_start
x86/efi: Fix boot failure with EFI stub
x86/platform/hyperv: Handle VMBUS driver being a module
x86/apic: Reinstate error IRQ Pentium erratum 3AP workaround
x86, CMCI: Add proper detection of end of CMCI storms