We have 2 functions using the same sched_task callback:
- PEBS drain for free running counters
- LBR save/store
Both of them are called from intel_pmu_sched_task() and
either of them can be unwillingly triggered when the
other one is configured to run.
Let's say there's PEBS drain configured in sched_task
callback for the event, but in the callback itself
(intel_pmu_sched_task()) we will also run the code for
LBR save/restore, which we did not ask for, but the
code in intel_pmu_sched_task() does not check for that.
This can lead to extra cycles in some perf monitoring,
like when we monitor PEBS event without LBR data.
# perf record --no-timestamp -c 10000 -e cycles:p ./perf bench sched pipe -l 1000000
(We need PEBS, non freq/non timestamp event to enable
the sched_task callback)
The perf stat of cycles and msr:write_msr for above
command before the change:
...
Performance counter stats for './perf record --no-timestamp -c 10000 -e cycles:p \
./perf bench sched pipe -l 1000000' (5 runs):
18,519,557,441 cycles:k
91,195,527 msr:write_msr
29.334476406 seconds time elapsed
And after the change:
...
Performance counter stats for './perf record --no-timestamp -c 10000 -e cycles:p \
./perf bench sched pipe -l 1000000' (5 runs):
18,704,973,540 cycles:k
27,184,720 msr:write_msr
16.977875900 seconds time elapsed
There's no affect on cycles:k because the sched_task happens
with events switched off, however the msr:write_msr tracepoint
counter together with almost 50% of time speedup show the
improvement.
Monitoring LBR event and having extra PEBS drain processing
in sched_task callback showed just a little speedup, because
the drain function does not do much extra work in case there
is no PEBS data.
Adding conditions to recognize the configured work that needs
to be done in the x86_pmu's sched_task callback.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20170719075247.GA27506@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The lbr_context logic confused me; it appears to me to try and do the
same thing the pmu::sched_task() callback does now, but limited to
per-task events.
So rip it out. Afaict this should also improve performance, because I
think the current code can end up doing lbr_reset() twice, once from
the pmu::add() and then again from pmu::sched_task(), and MSR writes
(all 3*16 of them) are expensive!!
While thinking through the cases that need the reset it occured to me
the first install of an event in an active context needs to reset the
LBR (who knows what crap is in there), but detecting this case is
somewhat hard.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add quirk for context switch to save/restore the value of
MSR_LAST_BRANCH_FROM_x when LBR is enabled and there is potential for
kernel addresses to be in the lbr_from register.
To test this patch, use a perf tool and kernel with the patch
next in this series. That patch removes the work around that masked
the hw bug:
$ ./lbr_perf record --call-graph lbr -e cycles:k sleep 1
where lbr_perf is the patched perf tool, that allows to specify :k
on lbr mode. The above command will trigger a #GPF :
WARNING: CPU: 28 PID: 14096 at arch/x86/mm/extable.c:65 ex_handler_wrmsr_unsafe+0x70/0x80
unchecked MSR access error: WRMSR to 0x681 (tried to write 0x1fffffff81010794)
...
Call Trace:
[<ffffffff8167af49>] dump_stack+0x4d/0x63
[<ffffffff810b9b15>] __warn+0xe5/0x100
[<ffffffff810b9be9>] warn_slowpath_fmt+0x49/0x50
[<ffffffff810abb40>] ex_handler_wrmsr_unsafe+0x70/0x80
[<ffffffff810abc42>] fixup_exception+0x42/0x50
[<ffffffff81079d1a>] do_general_protection+0x8a/0x160
[<ffffffff81684ec2>] general_protection+0x22/0x30
[<ffffffff810101b9>] ? intel_pmu_lbr_sched_task+0xc9/0x380
[<ffffffff81009d7c>] intel_pmu_sched_task+0x3c/0x60
[<ffffffff81003a2b>] x86_pmu_sched_task+0x1b/0x20
[<ffffffff81192a5b>] perf_pmu_sched_task+0x6b/0xb0
[<ffffffff8119746d>] __perf_event_task_sched_in+0x7d/0x150
[<ffffffff810dd9dc>] finish_task_switch+0x15c/0x200
[<ffffffff8167f894>] __schedule+0x274/0x6cc
[<ffffffff8167fdd9>] schedule+0x39/0x90
[<ffffffff81675398>] exit_to_usermode_loop+0x39/0x89
[<ffffffff810028ce>] prepare_exit_to_usermode+0x2e/0x30
[<ffffffff81683c1b>] retint_user+0x8/0x10
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1466533874-52003-5-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Intel's SDM states that bits 61:62 in MSR_LAST_BRANCH_FROM_x are the
TSX flags for formats with LBR_TSX flags (i.e. LBR_FORMAT_EIP_EFLAGS2).
However, when the CPU has TSX support deactivated, bits 61:62 actually
behave as follows:
- For wrmsr(), bits 61:62 are considered part of the sign extension.
- When capturing branches, the LBR hw will always clear bits 61:62.
regardless of the sign extension.
Therefore, if:
1) LBR has TSX format.
2) CPU has no TSX support enabled.
... then any value passed to wrmsr() must be sign extended to 63 bits
and any value from rdmsr() must be converted to have a sign extension
of 61 bits, ignoring the values at TSX flags.
This bug was masked by the work-around to the Intel's CPU bug:
BJ94. "LBR May Contain Incorrect Information When Using FREEZE_LBRS_ON_PMI"
in Document Number: 324643-037US.
The aforementioned work-around uses hw flags to filter out all kernel
branches, limiting LBR callstack to user level execution only.
Since user addresses are not sign extended, they do not trigger the wrmsr()
bug in MSR_LAST_BRANCH_FROM_x when saved/restored at context switch.
To verify the hw bug:
$ perf record -b -e cycles sleep 1
$ rdmsr -p 0 0x680
0x1fffffffb0b9b0cc
$ wrmsr -p 0 0x680 0x1fffffffb0b9b0cc
write(): Input/output error
The quirk for LBR_FROM_ MSRs is required before calls to wrmsrl() and
after rdmsrl().
This patch introduces it for wrmsrl()'s done for testing LBR support.
Future patch in series adds the quirk for context switch, that would
be required if LBR callstack is to be enabled for ring 0.
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1466533874-52003-3-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>