Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures,
with perf WARN_ON()s triggering. He also provided traces of the failures.
This is I think the relevant bit:
> pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_disable: x86_pmu_disable
> pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_state: Events: {
> pec_1076_warn-2804 [000] d... 147.926156: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null))
> pec_1076_warn-2804 [000] d... 147.926158: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926159: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1
> pec_1076_warn-2804 [000] d... 147.926161: x86_pmu_state: Assignment: {
> pec_1076_warn-2804 [000] d... 147.926162: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926163: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926166: collect_events: Adding event: 1 (ffff880119ec8800)
So we add the insn:p event (fd[23]).
At this point we should have:
n_events = 2, n_added = 1, n_txn = 1
> pec_1076_warn-2804 [000] d... 147.926170: collect_events: Adding event: 0 (ffff8800c9e01800)
> pec_1076_warn-2804 [000] d... 147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00)
We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]).
These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so
that's not visible.
group_sched_in()
pmu->start_txn() /* nop - BP pmu */
event_sched_in()
event->pmu->add()
So here we should end up with:
0: n_events = 3, n_added = 2, n_txn = 2
4: n_events = 4, n_added = 3, n_txn = 3
But seeing the below state on x86_pmu_enable(), the must have failed,
because the 0 and 4 events aren't there anymore.
Looking at group_sched_in(), since the BP is the leader, its
event_sched_in() must have succeeded, for otherwise we would not have
seen the sibling adds.
But since neither 0 or 4 are in the below state; their event_sched_in()
must have failed; but I don't see why, the complete state: 0,0,1:p,4
fits perfectly fine on a core2.
However, since we try and schedule 4 it means the 0 event must have
succeeded! Therefore the 4 event must have failed, its failure will
have put group_sched_in() into the fail path, which will call:
event_sched_out()
event->pmu->del()
on 0 and the BP event.
Now x86_pmu_del() will reduce n_events; but it will not reduce n_added;
giving what we see below:
n_event = 2, n_added = 2, n_txn = 2
> pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_enable: x86_pmu_enable
> pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_state: Events: {
> pec_1076_warn-2804 [000] d... 147.926179: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null))
> pec_1076_warn-2804 [000] d... 147.926181: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926182: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2
> pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: Assignment: {
> pec_1076_warn-2804 [000] d... 147.926186: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: 1->0 tag: 1 config: 1 (ffff880119ec8800)
> pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0
So the problem is that x86_pmu_del(), when called from a
group_sched_in() that fails (for whatever reason), and without x86_pmu
TXN support (because the leader is !x86_pmu), will corrupt the n_added
state.
Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Randomize the load address of modules in the kernel to make kASLR
effective for modules. Modules can only be loaded within a particular
range of virtual address space. This patch adds 10 bits of entropy to
the load address by adding 1-1024 * PAGE_SIZE to the beginning range
where modules are loaded.
The single base offset was chosen because randomizing each module
load ends up wasting/fragmenting memory too much. Prior approaches to
minimizing fragmentation while doing randomization tend to result in
worse entropy than just doing a single base address offset.
Example kASLR boot without this change, with a single module loaded:
---[ Modules ]---
0xffffffffc0000000-0xffffffffc0001000 4K ro GLB x pte
0xffffffffc0001000-0xffffffffc0002000 4K ro GLB NX pte
0xffffffffc0002000-0xffffffffc0004000 8K RW GLB NX pte
0xffffffffc0004000-0xffffffffc0200000 2032K pte
0xffffffffc0200000-0xffffffffff000000 1006M pmd
---[ End Modules ]---
Example kASLR boot after this change, same module loaded:
---[ Modules ]---
0xffffffffc0000000-0xffffffffc0200000 2M pmd
0xffffffffc0200000-0xffffffffc03bf000 1788K pte
0xffffffffc03bf000-0xffffffffc03c0000 4K ro GLB x pte
0xffffffffc03c0000-0xffffffffc03c1000 4K ro GLB NX pte
0xffffffffc03c1000-0xffffffffc03c3000 8K RW GLB NX pte
0xffffffffc03c3000-0xffffffffc0400000 244K pte
0xffffffffc0400000-0xffffffffff000000 1004M pmd
---[ End Modules ]---
Signed-off-by: Andy Honig <ahonig@google.com>
Link: http://lkml.kernel.org/r/20140226005916.GA27083@www.outflux.net
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull x86 fixes from Thomas Gleixner:
- a bugfix which prevents a divide by 0 panic when the newly introduced
try_msr_calibrate_tsc() fails
- enablement of the Baytrail platform to utilize the newfangled msr
based calibration
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: tsc: Add missing Baytrail frequency to the table
x86, tsc: Fallback to normal calibration if fast MSR calibration fails
These days hv_clock allocation is memblock based (i.e. the percpu
allocator is not involved), which means that the physical address
of each of the per-cpu hv_clock areas is guaranteed to remain
unchanged through all its lifetime and we do not need to update
its location after CPU bring-up.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using BTS on Core i7-4*, I get the below kernel warning.
$ perf record -c 1 -e branches:u ls
Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
kernel:[ 438.317893] Uhhuh. NMI received for unknown reason 31 on CPU 2.
Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
kernel:[ 438.317920] Do you have a strange power saving mode enabled?
Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
kernel:[ 438.317945] Dazed and confused, but trying to continue
Make intel_pmu_handle_irq() take the full exit path when returning early.
Cc: eranian@google.com
Cc: peterz@infradead.org
Cc: mingo@kernel.org
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392425048-5309-1-git-send-email-andi@firstfloor.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch is needed because that PMU uses 32-bit free
running counters with no interrupt capabilities.
On SNB/IVB/HSW, we used 20GB/s theoretical peak to calculate
the hrtimer timeout necessary to avoid missing an overflow.
That delay is set to 5s to be on the cautious side.
The SNB IMC uses free running counters, which are handled
via pseudo fixed counters. The SNB IMC PMU implementation
supports an arbitrary number of events, because the counters
are read-only. Therefore it is not possible to track active
counters. Instead we put active events on a linked list which
is then used by the hrtimer handler to update the SW counts.
Cc: mingo@elte.hu
Cc: acme@redhat.com
Cc: ak@linux.intel.com
Cc: zheng.z.yan@intel.com
Cc: peterz@infradead.org
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392132015-14521-8-git-send-email-eranian@google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Current ACPI cpu hotplug driver fails to associate hot-added CPUs with
corresponding NUMA node when doing socket online. The code path to
associate CPU with NUMA node is as below:
acpi_processor_add()
->acpi_processor_get_info()
->acpi_processor_hotadd_init()
->acpi_map_lsapic()
->_acpi_map_lsapic()
->acpi_map_cpu2node()
cpu_subsys_online()
->try_online_node()
->node_set_online()
When doing socket online, a new NUMA node is introduced in addition to
hot-added CPU and memory device. And the new NUMA node is marked as
online when onlining hot-added CPUs through sysfs interface
/sys/devices/system/cpu/cpuxx/online.
On the other hand, acpi_map_cpu2node() will only build the CPU to node
map if corresponding NUMA node is already online, so it always fails
to associate hot-added CPUs with corresponding NUMA node because the
NUMA node is still in offline state.
For the fix, we could safely remove the "node_online(node)" check in
function acpi_map_cpu2node() because it's only called for hot-added CPUs
by acpi_processor_hotadd_init().
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1390185115-26850-1-git-send-email-jiang.liu@linux.intel.com
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull DMA-mapping fixes from Marek Szyprowski:
"This contains fixes for incorrect atomic test in dma-mapping subsystem
for ARM and x86 architecture"
* 'fixes-for-v3.14' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
x86: dma-mapping: fix GFP_ATOMIC macro usage
ARM: dma-mapping: fix GFP_ATOMIC macro usage
Linux uses CPUID.MWAIT.EDX to validate the C-states
reported by ACPI, silently discarding states which
are not supported by the HW.
This test is too restrictive, as some HW now uses
sparse sub-state numbering, so the sub-state number
may be higher than the number of sub-states...
Also, rather than silently ignoring an invalid state,
we should complain about a firmware bug.
In practice...
Bay Trail systems originally supported C6-no-shrink as
MWAIT sub-state 0x58, and in CPUID.MWAIT.EDX 0x03000000
indicated that there were 3 MWAIT-C6 sub-states.
So acpi_idle would discard that C-state because 8 >= 3.
Upon discovering this issue, the ucode was updated so that
C6-no-shrink was also exported as 0x51, and the BIOS was
updated to match. However, systems shipped with 0x58,
will never get a BIOS update, and this patch allows
Linux to see C6-no-shrink on early Bay Trail.
Signed-off-by: Len Brown <len.brown@intel.com>
If we cannot calibrate TSC via MSR based calibration
try_msr_calibrate_tsc() stores zero to fast_calibrate and returns that
to the caller. This value gets then propagated further to clockevents
code resulting division by zero oops like the one below:
divide error: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 3.13.0+ #47
task: ffff880075508000 ti: ffff880075506000 task.ti: ffff880075506000
RIP: 0010:[<ffffffff810aec14>] [<ffffffff810aec14>] clockevents_config.part.3+0x24/0xa0
RSP: 0000:ffff880075507e58 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffff880079c0cd80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffffffffff
RBP: ffff880075507e70 R08: 0000000000000001 R09: 00000000000000be
R10: 00000000000000bd R11: 0000000000000003 R12: 000000000000b008
R13: 0000000000000008 R14: 000000000000b010 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880079c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880079fff000 CR3: 0000000001c0b000 CR4: 00000000001006f0
Stack:
ffff880079c0cd80 000000000000b008 0000000000000008 ffff880075507e88
ffffffff810aecb0 ffff880079c0cd80 ffff880075507e98 ffffffff81030168
ffff880075507ed8 ffffffff81d1104f 00000000000000c3 0000000000000000
Call Trace:
[<ffffffff810aecb0>] clockevents_config_and_register+0x20/0x30
[<ffffffff81030168>] setup_APIC_timer+0xc8/0xd0
[<ffffffff81d1104f>] setup_boot_APIC_clock+0x4cc/0x4d8
[<ffffffff81d0f5de>] native_smp_prepare_cpus+0x3dd/0x3f0
[<ffffffff81d02ee9>] kernel_init_freeable+0xc3/0x205
[<ffffffff8177c910>] ? rest_init+0x90/0x90
[<ffffffff8177c91e>] kernel_init+0xe/0x120
[<ffffffff8178deec>] ret_from_fork+0x7c/0xb0
[<ffffffff8177c910>] ? rest_init+0x90/0x90
Prevent this from happening by:
1) Modifying try_msr_calibrate_tsc() to return calibration value or zero
if it fails.
2) Check this return value in native_calibrate_tsc() and in case of zero
fallback to use normal non-MSR based calibration.
[mw: Added subject and changelog]
Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bin Gao <bin.gao@linux.intel.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1392810750-18660-1-git-send-email-mika.westerberg@linux.intel.com
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
BAD_MADT_ENTRY() is arch independent and will be used for all
architectures which parse MADT, so move it to linux/acpi.h to
reduce code duplication.
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The x86 CPU feature modalias handling existed before it was reimplemented
generically. This patch aligns the x86 handling so that it
(a) reuses some more code that is now generic;
(b) uses the generic format for the modalias module metadata entry, i.e., it
now uses 'cpu:type:x86,venVVVVfamFFFFmodMMMM:feature:,XXXX,YYYY' instead of
the 'x86cpu:vendor:VVVV👪FFFF:model:MMMM:feature:,XXXX,YYYY' that was
used before.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull twi tracing fixes from Steven Rostedt:
"Two urgent fixes in the tracing utility.
The first is a fix for the way the ring buffer stores timestamps.
After a restructure of the code was done, the ring buffer timestamp
logic missed the fact that the first event on a sub buffer is to have
a zero delta, as the full timestamp is stored on the sub buffer
itself. But because the delta was not cleared to zero, the timestamp
for that event will be calculated as the real timestamp + the delta
from the last timestamp. This can skew the timestamps of the events
and have them say they happened when they didn't really happen.
That's bad.
The second fix is for modifying the function graph caller site. When
the stop machine was removed from updating the function tracing code,
it missed updating the function graph call site location. It is still
modified as if it is being done via stop machine. But it's not. This
can lead to a GPF and kernel crash if the function graph call site
happens to lie between cache lines and one CPU is executing it while
another CPU is doing the update. It would be a very hard condition to
hit, but the result is severe enough to have it fixed ASAP"
* tag 'trace-fixes-v3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/x86: Use breakpoints for converting function graph caller
ring-buffer: Fix first commit on sub-buffer having non-zero delta
There should no longer be any IBM x440 systems or those using the
Summit/EXA chipset out in the wild, so remove support for it.
We've done our due diligence in reaching out to any contact information
listed for this chipset and no indication was given that it should be
kept around.
Signed-off-by: David Rientjes <rientjes@google.com>
There should no longer be any ia32-based Unisys ES7000 systems out in
the wild, so remove support for it.
We've done our due diligence in reaching out to any contact information
listed for this system and no indication was given that it should be
kept around.
Signed-off-by: David Rientjes <rientjes@google.com>
When the conversion was made to remove stop machine and use the breakpoint
logic instead, the modification of the function graph caller is still
done directly as though it was being done under stop machine.
As it is not converted via stop machine anymore, there is a possibility
that the code could be layed across cache lines and if another CPU is
accessing that function graph call when it is being updated, it could
cause a General Protection Fault.
Convert the update of the function graph caller to use the breakpoint
method as well.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org # 3.5+
Fixes: 08d636b6d4 "ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
GFP_ATOMIC is not a single gfp flag, but a macro which expands to the other
flags, where meaningful is the LACK of __GFP_WAIT flag. To check if caller
wants to perform an atomic allocation, the code must test for a lack of the
__GFP_WAIT flag. This patch fixes the issue introduced in v3.5-rc1.
CC: stable@vger.kernel.org
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
There isn't an explicit stolen memory base register on gen2.
Some old comment in the i915 code suggests we should get it via
max_low_pfn_mapped, but that's clearly a bad idea on my MGM.
The e820 map in said machine looks like this:
BIOS-e820: [mem 0x0000000000000000-0x000000000009f7ff] usable
BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000ce000-0x00000000000cffff] reserved
BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x000000001f6effff] usable
BIOS-e820: [mem 0x000000001f6f0000-0x000000001f6f7fff] ACPI data
BIOS-e820: [mem 0x000000001f6f8000-0x000000001f6fffff] ACPI NVS
BIOS-e820: [mem 0x000000001f700000-0x000000001fffffff] reserved
BIOS-e820: [mem 0x00000000fec10000-0x00000000fec1ffff] reserved
BIOS-e820: [mem 0x00000000ffb00000-0x00000000ffbfffff] reserved
BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
That makes max_low_pfn_mapped = 1f6f0000, so assuming our stolen
memory would start there would place it on top of some ACPI
memory regions. So not a good idea as already stated.
The 9MB region after the ACPI regions at 0x1f700000 however
looks promising given that the macine reports the stolen memory
size to be 8MB. Looking at the PGTBL_CTL register, the GTT
entries are at offset 0x1fee00000, and given that the GTT
entries occupy 128KB, it looks like the stolen memory could
start at 0x1f700000 and the GTT entries would occupy the last
128KB of the stolen memory.
After some more digging through chipset documentation, I've
determined the BIOS first allocates space for something called
TSEG (something to do with SMM) from the top of memory, and then
it allocates the graphics stolen memory below that. Accordind to
the chipset documentation TSEG has a fixed size of 1MB on 855.
So that explains the top 1MB in the e820 region. And it also
confirms that the GTT entries are in fact at the end of the the
stolen memory region.
Derive the stolen memory base address on gen2 the same as the
BIOS does (TOM-TSEG_SIZE-stolen_size). There are a few
differences between the registers on various gen2 chipsets, so a
few different codepaths are required.
865G is again bit more special since it seems to support enough
memory to hit 4GB address space issues. This means the PCI
allocations will also affect the location of the stolen memory.
Fortunately there appears to be the TOUD register which may give
us the correct answer directly. But the chipset docs are a bit
unclear, so I'm not 100% sure that the graphics stolen memory is
always the last thing the BIOS steals. Someone would need to
verify it on a real system.
I tested this on the my 830 and 855 machines, and so far
everything looks peachy.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1391628540-23072-3-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A bunch of unknown NMIs have popped up on a Pentium4 recently when booting
into a kdump kernel. This was exposed because the watchdog timer went
from 60 seconds down to 10 seconds (increasing the ability to reproduce
this problem).
What is happening is on boot up of the second kernel (the kdump one),
the previous nmi_watchdogs were enabled on thread 0 and thread 1. The
second kernel only initializes one cpu but the perf counter on thread 1
still counts.
Normally in a kdump scenario, the other cpus are blocking in an NMI loop,
but more importantly their local apics have the performance counters disabled
(iow LVTPC is masked). So any counters that fire are masked and never get
through to the second kernel.
However, on a P4 the local apic is shared by both threads and thread1's PMI
(despite being configured to only interrupt thread1) will generate an NMI on
thread0. Because thread0 knows nothing about this NMI, it is seen as an
unknown NMI.
This would be fine because it is a kdump kernel, strange things happen
what is the big deal about a single unknown NMI.
Unfortunately, the P4 comes with another quirk: clearing the overflow bit
to prevent a stream of NMIs. This is the problem.
The kdump kernel can not execute because of the endless NMIs that happen.
To solve this, I instrumented the p4 perf init code, to walk all the counters
and zero them out (just like a normal reset would).
Now when the counters go off, they do not generate anything and no unknown
NMIs are seen.
I tested this on a P4 we have in our lab. After two or three crashes, I could
normally reproduce the problem. Now after 10 crashes, everything continues
to boot correctly.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140120154115.GZ25953@redhat.com
[ Fixed a stylistic detail. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On a P4 box stressing perf with:
./perf record -o perf.data ./perf stat -v ./perf bench all
it was noticed that a slew of unknown NMIs would pop out rather quickly.
Painfully debugging this ancient platform, led me to notice cross cpu counter
corruption.
The P4 machine is special in that it has 18 counters, half are used for cpu0
and the other half is for cpu1 (or all 18 if hyperthreading is disabled). But
the splitting of the counters has to be actively managed by the software.
In this particular bug, one of the cpu0 specific counters was being used by
cpu1 and caused all sorts of random unknown nmis.
I am not entirely sure on the corruption path, but what happens is:
o perf schedules a group with p4_pmu_schedule_events()
o inside p4_pmu_schedule_events(), it notices an hwc pointer is being reused
but for a different cpu, so it 'swaps' the config bits and returns the
updated 'assign' array with a _new_ index.
o perf schedules another group with p4_pmu_schedule_events()
o inside p4_pmu_schedule_events(), it notices an hwc pointer is being reused
(the same one as above) but for the _same_ cpu [BUG!!], so it updates the
'assign' array to use the _old_ (wrong cpu) index because the _new_ index is in
an earlier part of the 'assign' array (and hasn't been committed yet).
o perf commits the transaction using the wrong index and corrupts the other cpu
The [BUG!!] is because the 'hwc->config' is updated but not the 'hwc->idx'. So
the check for 'p4_should_swap_ts()' is correct the first time around but
incorrect the second time around (because hwc->config was updated in between).
I think the spirit of perf was to not modify anything until all the
transactions had a chance to 'test' if they would succeed, and if so, commit
atomically. However, P4 breaks this spirit by touching the hwc->config
element.
So my fix is to continue the un-perf like breakage, by assigning hwc->idx to -1
on swap to tell follow up group scheduling to find a new index.
Of course if the transaction fails rolling this back will be difficult, but
that is not different than how the current code works. :-) And I wasn't sure
how much effort to cleanup the code I should do for a platform that is almost
10 years old by now.
Hence the lazy fix.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391024270-19469-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When debug preempt is enabled, preempt_disable() can be traced by
function and function graph tracing.
There's a place in the function graph tracer that calls trace_clock()
which eventually calls cycles_2_ns() outside of the recursion
protection. When cycles_2_ns() calls preempt_disable() it gets traced
and the graph tracer will go into a recursive loop causing a crash or
worse, a triple fault.
Simple fix is to use preempt_disable_notrace() in cycles_2_ns, which
makes sense because the preempt_disable() tracing may use that code
too, and it tracing it, even with recursion protection is rather
pointless.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140204141315.2a968a72@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>