When testing stmmac with my QoS reference design I checked a problem in the
CSR clock configuration that was impossibilitating the phy discovery, since
every read operation returned 0x0000ffff. This patch fixes the issue.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The idle work handler is self-arming - if it detects that it needs to
run again it will queue itself from its work handler. Take greater care
when trying to drain the idle work, and double check that it is flushed.
The free worker has a similar issue where it is armed by an RCU task
which may be running concurrently with us.
This should hopefully help with the sporadic WARN_ON(dev_priv->gt.awake)
from i915_gem_suspend.
v2: Reuse drain_freed_objects.
v3: Don't try to flush the freed objects from the shrinker, as it may be
underneath the struct_mutex already.
v4: do while and comment upon the excess rcu_barrier in drain_freed_objects
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161223145804.6605-2-chris@chris-wilson.co.uk
Now when adding an ipt_CLUSTERIP rule, it only checks duplicate config in
clusterip_config_find_get(). But after that, there may be still another
thread to insert a config with the same ip, then it leaves proc_create_data
to do duplicate check.
It's more reasonable to check duplicate config by ipt_CLUSTERIP itself,
instead of checking it by proc fs duplicate file check. Before, when proc
fs allowed duplicate name files in a directory, It could even crash kernel
because of use-after-free.
This patch is to check duplicate config under the protection of clusterip
net lock when initializing a new config and correct the return err.
Note that it also moves proc file node creation after adding new config, as
proc_create_data may sleep, it couldn't be called under the clusterip_net
lock. clusterip_config_find_get returns NULL if c->pde is null to make sure
it can't be used until the proc file node creation is done.
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sparse says:
fs/ufs/inode.c:1195:6: warning: symbol 'ufs_truncate_blocks' was not declared. Should it be static?
Note that the forward declaration in the file is already marked static.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If you have a process that has set itself to be non-dumpable, and it
then undergoes exec(2), any CLOEXEC file descriptors it has open are
"exposed" during a race window between the dumpable flags of the process
being reset for exec(2) and CLOEXEC being applied to the file
descriptors. This can be exploited by a process by attempting to access
/proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE.
The race in question is after set_dumpable has been (for get_link,
though the trace is basically the same for readlink):
[vfs]
-> proc_pid_link_inode_operations.get_link
-> proc_pid_get_link
-> proc_fd_access_allowed
-> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
Which will return 0, during the race window and CLOEXEC file descriptors
will still be open during this window because do_close_on_exec has not
been called yet. As a result, the ordering of these calls should be
reversed to avoid this race window.
This is of particular concern to container runtimes, where joining a
PID namespace with file descriptors referring to the host filesystem
can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
against access of CLOEXEC file descriptors -- file descriptors which may
reference filesystem objects the container shouldn't have access to).
Cc: dev@opencontainers.org
Cc: <stable@vger.kernel.org> # v3.2+
Reported-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If kernfs file is empty on a first read, successive read operations
using the same file descriptor will return no data, even when data is
available. Default kernfs 'seq_next' implementation advances iterator
position even when next object is not there. Kernfs 'seq_start' for
following requests will not return iterator as position is already on
the second object.
This defect doesn't allow to monitor badblocks sysfs files from MD raid.
They are initially empty but if data appears at some stage, userspace is
not able to read it.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Strengthen the checking of pos/len vs. i_size, clarify the return values
for the clone prep function, and remove pointless code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Problem similar to ones dealt with in "fold checks into iterate_and_advance()"
and followups, except that in this case we really want to do nothing when
asked for zero-length operation - unlike zero-length iterate_and_advance(),
zero-length iterate_all_kinds() has no side effects, and callers are simpler
that way.
That got exposed when copy_from_iter_full() had been used by tipc, which
builds an msghdr with zero payload and (now) feeds it to a primitive
based on iterate_all_kinds() instead of iterate_and_advance().
Reported-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and fix the minor buglet in compat io_submit() - native one
kills ioctx as cleanup when put_user() fails. Get rid of
bogus compat_... in !CONFIG_AIO case, while we are at it - they
should simply fail with ENOSYS, same as for native counterparts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When --time option is given with a value outside recorded time, the last
sample time (tprev) was set to that value and run time calculation might
be incorrect. This is a problem of the first samples for each cpus
since it would skip the runtime update when tprev is 0. But with --time
option it had non-zero (which is invalid) value so the calculation is
also incorrect.
For example, let's see the followging:
$ perf sched timehist
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
3195.968367 [0003] <idle> 0.000 0.000 0.000
3195.968386 [0002] Timer[4306/4277] 0.000 0.000 0.018
3195.968397 [0002] Web Content[4277] 0.000 0.000 0.000
3195.968595 [0001] JS Helper[4302/4277] 0.000 0.000 0.000
3195.969217 [0000] <idle> 0.000 0.000 0.621
3195.969251 [0001] kworker/1:1H[291] 0.000 0.000 0.033
The sample starts at 3195.968367 but when I gave a time interval from
3194 to 3196 (in sec) it will calculate the whole 2 second as runtime.
In below, 2 cpus accounted it as runtime, other 2 cpus accounted it as
idle time.
Before:
$ perf sched timehist --time 3194,3196 -s | tail
Idle stats:
CPU 0 idle for 1995.991 msec
CPU 1 idle for 20.793 msec
CPU 2 idle for 30.191 msec
CPU 3 idle for 1999.852 msec
Total number of unique tasks: 23
Total number of context switches: 128
Total run time (msec): 3724.940
After:
$ perf sched timehist --time 3194,3196 -s | tail
Idle stats:
CPU 0 idle for 10.811 msec
CPU 1 idle for 20.793 msec
CPU 2 idle for 30.191 msec
CPU 3 idle for 18.337 msec
Total number of unique tasks: 23
Total number of context switches: 128
Total run time (msec): 18.139
Committer notes:
Further testing:
Before:
Idle stats:
CPU 0 idle for 229.785 msec
CPU 1 idle for 937.944 msec
CPU 2 idle for 188.931 msec
CPU 3 idle for 986.185 msec
After:
# perf sched timehist --time 40602,40603 -s | tail
Idle stats:
CPU 0 idle for 229.785 msec
CPU 1 idle for 175.407 msec
CPU 2 idle for 188.931 msec
CPU 3 idle for 223.657 msec
Total number of unique tasks: 68
Total number of context switches: 814
Total run time (msec): 97.688
# for cpu in `seq 0 3` ; do echo -n "CPU $cpu idle for " ; perf sched timehist --time 40602,40603 | grep "\[000${cpu}\].*\<idle\>" | tr -s ' ' | cut -d' ' -f7 | awk '{entries++ ; s+=$1} END {print s " msec (entries: " entries ")"}' ; done
CPU 0 idle for 229.721 msec (entries: 123)
CPU 1 idle for 175.381 msec (entries: 65)
CPU 2 idle for 188.903 msec (entries: 56)
CPU 3 idle for 223.61 msec (entries: 102)
Difference due to the idle stats being accounted at nanoseconds precision while
the <idle> entries in 'perf sched timehist' are trucated at msec.usec.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixes: 853b740711 ("perf sched timehist: Add option to specify time window of interest")
Link: http://lkml.kernel.org/r/20161222060350.17655-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Current default value is 20 but it's easily changed to a bigger value as
task has a long name and different tid and pid. And it makes the output
not aligned. So change it to have a large value as summary shows.
Committer notes:
Before:
# perf sched record
^C
# perf sched timehist
<SNIP>
40602.770537 [0001] rcuos/2[29] 7.970 0.002 0.020
40602.771512 [0003] <idle> 0.003 0.000 0.986
40602.771586 [0001] <idle> 0.020 0.000 1.049
40602.771606 [0001] qemu-system-x86[3593/3510] 0.000 0.002 0.020
40602.771629 [0003] qemu-system-x86[3510] 0.000 0.003 0.116
40602.771776 [0000] <idle> 0.001 0.000 1.892
<SNIP>
After:
# perf sched timehist
<SNIP>
40602.770537 [0001] rcuos/2[29] 7.970 0.002 0.020
40602.771512 [0003] <idle> 0.003 0.000 0.986
40602.771586 [0001] <idle> 0.020 0.000 1.049
40602.771606 [0001] qemu-system-x86[3593/3510] 0.000 0.002 0.020
40602.771629 [0003] qemu-system-x86[3510] 0.000 0.003 0.116
<SNIP>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org
[ Split from a larger patch ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
BSpec says:
"Overlay Clock Gating Must be Disabled: Overlay & L2 Cache clock gating
must be disabled in order to prevent device hangs when turning off overlay.SW
must turn off Ovrunit clock gating (6200h) and L2 Cache clock gating (C8h)."
We only turned off the overlay clock gating (due to lack of docs I
presume). After a bit of experimentation it looks like the the magic
C8h register lives in the PCI config space of device 0, and the magic
bit appears to be bit 2. Or at the very least this eliminates the GPU
death after MI_OVERLAY_OFF.
L2 clock gating seems to save ~80mW, so let's keep it on unless we need
to actually use the overlay.
Also let's move the OVRUNIT clock gating to the same place since we can,
and 845 supposedly doesn't need it.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1481131693-27993-11-git-send-email-ville.syrjala@linux.intel.com
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Do the overlay frontbuffer tracking properly so that it matches
the state of the overlay on/off/continue requests.
One slight problem is that intel_frontbuffer_flip_complete()
may get delayed by an arbitrarily liong time due to the fact that
the overlay code likes to bail out when a signal occurs. So the
flip may not get completed until the ioctl is restarted. But fixing
that would require bigger surgery, so I decided to ignore it for now.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1481131693-27993-5-git-send-email-ville.syrjala@linux.intel.com
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
First set of i915 fixes for code in next.
* tag 'drm-intel-next-fixes-2016-12-22' of git://anongit.freedesktop.org/git/drm-intel:
drm/i915: skip the first 4k of stolen memory on everything >= gen8
drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
drm/i915: Fix use after free in logical_render_ring_init
drm/i915: disable PSR by default on HSW/BDW
drm/i915: Fix setting of boost freq tunable
drm/i915: tune down the fast link training vs boot fail
drm/i915: Reorder phys backing storage release
drm/i915/gen9: Fix PCODE polling during SAGV disabling
drm/i915/gen9: Fix PCODE polling during CDCLK change notification
drm/i915/dsi: Fix chv_exec_gpio disabling the GPIOs it is setting
drm/i915/dsi: Fix swapping of MIPI_SEQ_DEASSERT_RESET / MIPI_SEQ_ASSERT_RESET
drm/i915/dsi: Do not clear DPOUNIT_CLOCK_GATE_DISABLE from vlv_init_display_clock_gating
drm/i915: drop the struct_mutex when wedged or trying to reset
Here's the one lonely bugfix I talked about on irc.
* tag 'drm-misc-fixes-2016-12-22' of git://anongit.freedesktop.org/git/drm-misc:
drivers/gpu/drm/ast: Fix infinite loop if read fails
- fix display regression on DCE6/8
- Powergating fixes for GFX8
- amdgpu SI fixes (golden settings, proper rev id setup, etc.)
* 'drm-next-4.10' of git://people.freedesktop.org/~agd5f/linux: (21 commits)
drm/amdgpu: update tile table for oland/hainan
drm/amdgpu: update tile table for verde
drm/amdgpu: update rev id for verde
drm/amdgpu: update golden setting for verde
drm/amdgpu: update rev id for oland
drm/amdgpu: update golden setting for oland
drm/amdgpu: update rev id for hainan
drm/amdgpu: update golden setting for hainan
drm/amdgpu: update rev id for pitcairn
drm/amdgpu: update golden setting for pitcairn
drm/amdgpu: update golden setting/tiling table of tahiti
drm/amdgpu: fix cursor setting of dce6/dce8
drm/amdgpu: refine set clock gating for tonga/polaris
drm/amdgpu: initialize cg flags for tonga/polaris10/polaris11.
drm/amdgpu: add new gfx cg flags.
drm/amdgpu: fix pg can't be disabled by PG mask.
drm/amdgpu: always initialize gfx pg for gfx_v8.0.
drm/amdgpu: enable AMD_PG_SUPPORT_CP in Carrizo/Stoney.
drm/amdgpu: fix init save/restore list in gfx_v8.0
drm/amdgpu: fix enable_cp_power_gating in gfx_v8.0.
...
Christoph writes:
The most significant one is that we've agreed on shared maintaince and
a common repository for the PCIe NVMe driver and NVMe over Fabrics. The
target code still only has a subset of the maintainers but goes through
the same tree as well. Keith, Sagi and me will take turns at collecting
patches and sending you pull requests.
blk_scsi_cmd_filter use was deprecated by 4beab5c6 and the SCSI macros
are duplicated in blkdev.h, both likely reintroduced by a bad merge from
540eed56.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This allows sending larger than 1 MB requests to devices that support
large I/O sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull LED maintainer email update from Jacek Anaszewski:
"Update Jacek Anaszewski's email address"
* tag 'leds_for_4.10_email_update' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
MAINTAINERS: Update Jacek Anaszewski's email address
Pull block layer fixes from Jens Axboe:
"Just a set of small fixes that have either been queued up after the
original pull for this merge window, or just missed the original pull
request.
- a few bcache fixes/changes from Eric and Kent
- add WRITE_SAME to the command filter whitelist frm Mauricio
- kill an unused struct member from Ritesh
- partition IO alignment fix from Stefan
- nvme sysfs printf fix from Stephen"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: check partition alignment
nvme : Use correct scnprintf in cmb show
block: allow WRITE_SAME commands with the SG_IO ioctl
block: Remove unused member (busy) from struct blk_queue_tag
bcache: partition support: add 16 minors per bcacheN device
bcache: Make gc wakeup sane, remove set_task_state()
Pull more ACPI updates from Rafael Wysocki:
"Here are new versions of two ACPICA changes that were deferred
previously due to a problem they had introduced, two cleanups on top
of them and the removal of a useless warning message from the ACPI
core.
Specifics:
- Move some Linux-specific functionality to upstream ACPICA and
update the in-kernel users of it accordingly (Lv Zheng)
- Drop a useless warning (triggered by the lack of an optional
object) from the ACPI namespace scanning code (Zhang Rui)"
* tag 'acpi-extra-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / osl: Remove deprecated acpi_get_table_with_size()/early_acpi_os_unmap_memory()
ACPI / osl: Remove acpi_get_table_with_size()/early_acpi_os_unmap_memory() users
ACPICA: Tables: Allow FADT to be customized with virtual address
ACPICA: Tables: Back port acpi_get_table_with_size() and early_acpi_os_unmap_memory() from Linux kernel
ACPI: do not warn if _BQC does not exist
Pull power management fixes from Rafael Wysocki:
"They fix one bug introduced recently, a build warning and a kerneldoc
function description.
Specifics:
- Prevent the acpi-cpufreq driver from crashing on exit by fixing a
check against the __cpuhp_setup_state() return value and fix the
kerneldoc description of that function to make it clear that it may
return positive numbers on success too (Boris Ostrovsky)
- Drop an incorrect __init annotation of a function in the s3c64xx
cpufreq driver and fix a build warning generated (by older
compilers) because of it (Arnd Bergmann)"
* tag 'pm-fixes-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: s3c64xx: remove incorrect __init annotation
cpufreq: Remove CPU hotplug callbacks only if they were initialized
CPU/hotplug: Clarify description of __cpuhp_setup_state() return value
Pull MMC fixes from Ulf Hansson:
"MMC core:
- further fix thread wake-up for requests
- use a bounce buffer to fix DMA issue for SSR register read
MMC host:
- sdhci: Fix a regression for runtime PM
- sdhci-cadence: Add a proper SoC specific DT compatible"
* tag 'mmc-v4.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sd: Meet alignment requirements for raw_ssr DMA
mmc: core: Further fix thread wake-up
mmc: sdhci: Fix to handle MMC_POWER_UNDEFINED
mmc: sdhci-cadence: add Socionext UniPhier specific compatible string
Pull SElinux fix from James Morris:
"From Paul:
'A small SELinux patch to fix some clang/llvm compiler warnings and
ensure the tools under scripts work well in the face of kernel
changes'"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
selinux: use the kernel headers when building scripts/selinux
Pull x86 cache allocation interface from Thomas Gleixner:
"This provides support for Intel's Cache Allocation Technology, a cache
partitioning mechanism.
The interface is odd, but the hardware interface of that CAT stuff is
odd as well.
We tried hard to come up with an abstraction, but that only allows
rather simple partitioning, but no way of sharing and dealing with the
per package nature of this mechanism.
In the end we decided to expose the allocation bitmaps directly so all
combinations of the hardware can be utilized.
There are two ways of associating a cache partition:
- Task
A task can be added to a resource group. It uses the cache
partition associated to the group.
- CPU
All tasks which are not member of a resource group use the group to
which the CPU they are running on is associated with.
That allows for simple CPU based partitioning schemes.
The main expected user sare:
- Virtualization so a VM can only trash only the associated part of
the cash w/o disturbing others
- Real-Time systems to seperate RT and general workloads.
- Latency sensitive enterprise workloads
- In theory this also can be used to protect against cache side
channel attacks"
[ Intel RDT is "Resource Director Technology". The interface really is
rather odd and very specific, which delayed this pull request while I
was thinking about it. The pull request itself came in early during
the merge window, I just delayed it until things had calmed down and I
had more time.
But people tell me they'll use this, and the good news is that it is
_so_ specific that it's rather independent of anything else, and no
user is going to depend on the interface since it's pretty rare. So if
push comes to shove, we can just remove the interface and nothing will
break ]
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/intel_rdt: Implement show_options() for resctrlfs
x86/intel_rdt: Call intel_rdt_sched_in() with preemption disabled
x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount
x86/intel_rdt: Fix setting of closid when adding CPUs to a group
x86/intel_rdt: Update percpu closid immeditately on CPUs affected by changee
x86/intel_rdt: Reset per cpu closids on unmount
x86/intel_rdt: Select KERNFS when enabling INTEL_RDT_A
x86/intel_rdt: Prevent deadlock against hotplug lock
x86/intel_rdt: Protect info directory from removal
x86/intel_rdt: Add info files to Documentation
x86/intel_rdt: Export the minimum number of set mask bits in sysfs
x86/intel_rdt: Propagate error in rdt_mount() properly
x86/intel_rdt: Add a missing #include
MAINTAINERS: Add maintainer for Intel RDT resource allocation
x86/intel_rdt: Add scheduler hook
x86/intel_rdt: Add schemata file
x86/intel_rdt: Add tasks files
x86/intel_rdt: Add cpus file
x86/intel_rdt: Add mkdir to resctrl file system
x86/intel_rdt: Add "info" files to resctrl file system
...
Jiri reported the overlap scheduling exceeding its max stack.
Looking at the constraint that triggered this, it turns out the
overlap marker isn't needed.
The comment with EVENT_CONSTRAINT_OVERLAP states: "This is the case if
the counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight".
Esp. that latter part is of interest here I think, our overlapping mask
is 0x0e, that has 3 bits set and is the highest weight mask in on the
PMU, therefore it will be placed last. Can we still create a scenario
where we would need to rewind that?
The scenario for AMD Fam15h is we're having masks like:
0x3F -- 111111
0x38 -- 111000
0x07 -- 000111
0x09 -- 001001
And we mark 0x09 as overlapping, because it is not a direct subset of
0x38 or 0x07 and has less weight than either of those. This means we'll
first try and place the 0x09 event, then try and place 0x38/0x07 events.
Now imagine we have:
3 * 0x07 + 0x09
and the initial pick for the 0x09 event is counter 0, then we'll fail to
place all 0x07 events. So we'll pop back, try counter 4 for the 0x09
event, and then re-try all 0x07 events, which will now work.
The masks on the PMU in question are:
0x01 - 0001
0x03 - 0011
0x0e - 1110
0x0c - 1100
But since all the masks that have overlap (0xe -> {0xc,0x3}) and (0x3 ->
0x1) are of heavier weight, it should all work out.
Reported-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20161109155153.GQ3142@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch solves a race condition between PEBS and the PMU handler.
In case multiple PEBS events are sampled at the same time,
it is possible to have GLOBAL_STATUS bit 62 set indicating
PEBS buffer overflow and also seeing at most 3 PEBS counters
having their bits set in the status register. This is a sign
that there was at least one PEBS record pending at the time
of the PMU interrupt. PEBS counters must only be processed
via the drain_pebs() calls, and not via the regular sample
processing loop coming after that the function, otherwise
phony regular samples may be generated in the sampling buffer
not marked with the EXACT tag.
Another possibility is to have one PEBS event and at least
one non-PEBS event whic hoverflows while PEBS has armed. In this
case, bit 62 of GLOBAL_STATUS will not be set, yet the overflow
status bit for the PEBS counter will be on Skylake.
To avoid this problem, we systematically ignore the PEBS-enabled
counters from the GLOBAL_STATUS mask and we always process PEBS
events via drain_pebs().
The problem manifested itself by having non-exact samples when
sampling only PEBS events, i.e., the PERF_SAMPLE_RECORD would
not have the EXACT flag set.
Note that this problem is only present on Skylake processor.
This fix is harmless on older processors.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482395366-8992-1-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>