It is wrong to block the user thread in the next poll when OA data is
already available which could not fit in the user buffer provided in
the previous read. In several cases the exact user buffer size is not
known. Blocking user space in poll can lead to data loss when the
buffer size used is smaller than the available data.
This change fixes this issue and allows user space to read all OA data
even when using a buffer size smaller than the available data using
multiple non-blocking reads rather than staying blocked in poll till
the next timer interrupt.
v2: Fix ret value for blocking reads (Umesh)
v3: Mistake during patch send (Ashutosh)
v4: Remove -EAGAIN from comment (Umesh)
v5: Improve condition for clearing pollin and return (Lionel)
v6: Improve blocking read loop and other cleanups (Lionel)
v7: Added Cc stable
Testcase: igt/perf/polling-small-buf
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200403010120.3067-1-ashutosh.dixit@intel.com
(cherry-picked from commit 6352219c39)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Inside the general i915_oa_init_reg_state() we avoid using the
perf->mutex. However, we rely on perf->exclusive_stream being valid to
access at that point, and for that we have to control the race with
disabling perf. This relies on the disabling being a heavy barrier that
inspects all active contexts, after marking the perf->exclusive_stream
as not available. This should ensure that there are no more concurrent
accesses to the perf->exclusive_stream as we destroy it.
Mark up the races around the perf->exclusive_stream so that they stand
out much more. (And hopefully we will be running kcsan to start
validating that the only races we have are carefully controlled.)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200227085723.1961649-2-chris@chris-wilson.co.uk
Converts various instances of the printk drm logging macros to the
struct drm_device based logging macros in the drm/i915 folder using the
following coccinelle script that transforms based on the existence of
the struct drm_i915_private device pointer:
@@
identifier fn, T;
@@
fn(...) {
...
struct drm_i915_private *T = ...;
<+...
(
-DRM_INFO(
+drm_info(&T->drm,
...)
|
-DRM_ERROR(
+drm_err(&T->drm,
...)
|
-DRM_WARN(
+drm_warn(&T->drm,
...)
|
-DRM_DEBUG_KMS(
+drm_dbg_kms(&T->drm,
...)
|
-DRM_DEBUG_DRIVER(
+drm_dbg(&T->drm,
...)
|
-DRM_DEBUG_ATOMIC(
+drm_dbg_atomic(&T->drm,
...)
)
...+>
}
@@
identifier fn, T;
@@
fn(...,struct drm_i915_private *T,...) {
<+...
(
-DRM_INFO(
+drm_info(&T->drm,
...)
|
-DRM_ERROR(
+drm_err(&T->drm,
...)
|
-DRM_WARN(
+drm_warn(&T->drm,
...)
|
-DRM_DEBUG_DRIVER(
+drm_dbg(&T->drm,
...)
|
-DRM_DEBUG_KMS(
+drm_dbg_kms(&T->drm,
...)
|
-DRM_DEBUG_ATOMIC(
+drm_dbg_atomic(&T->drm,
...)
)
...+>
}
Checkpatch warnings were fixed manually.
Instances of the DRM_DEBUG macro were not converted due to lack of a
consensus of an analogous struct drm_device based macro.
References: https://lists.freedesktop.org/archives/dri-devel/2020-January/253381.html
Signed-off-by: Wambui Karuga <wambui.karugax@gmail.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200131093416.28431-2-wambui.karugax@gmail.com
Allocate only an internal intel_context for the kernel_context, forgoing
a global GEM context for internal use as we only require a separate
address space (for our own protection).
Now having weaned GT from requiring ce->gem_context, we can stop
referencing it entirely. This also means we no longer have to create random
and unnecessary GEM contexts for internal use.
GEM contexts are now entirely for tracking GEM clients, and intel_context
the execution environment on the GPU.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andi Shyti <andi.shyti@intel.com>
Acked-by: Andi Shyti <andi.shyti@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191221160324.1073045-1-chris@chris-wilson.co.uk
As the engine->kernel_context is used within the engine-pm barrier, we
have to be careful when emitting requests outside of the barrier, as the
strict timeline locking rules do not apply. Instead, we must ensure the
engine_park() cannot be entered as we build the request, which is
simplest by taking an explicit engine-pm wakeref around the request
construction.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-1-chris@chris-wilson.co.uk
The design of the OA unit has been split into several units. We now
have a global unit (OAG) and a render specific unit (OAR). This leads
to some changes on how we program things. Some details :
OAR:
- has its own set of counter registers, they are per-context
saved/restored
- counters are not written to the circular OA buffer
- a snapshot of the counters can be acquired with
MI_RECORD_PERF_COUNT, or a single counter can be read with
MI_STORE_REGISTER_MEM.
OAG:
- has global counters that increment across context switches
- counters are written into the circular OA buffer (if requested)
v2: Fix checkpatch warnings on code style (Lucas)
v3: (Umesh)
- Update register from which tail, status and head are read
- Update logic to sample context reports
- Update whitelist mux and b counter regs
v4: Fix a bug when updating context image for new contexts (Umesh)
v5: Squash patch enabling save/restore of counters into context image
We want this so we can preempt performance queries and keep the
system responsive even when long running queries are ongoing. We
avoid doing it for all contexts.
- use LRI to modify context control (Chris)
- use MASKED_FIELD to program just the masked bits (Chris)
- disable save/restore of counters on cleanup (Chris)
v6: Do not use implicit parameters (Chris)
BSpec: 28727, 30021
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Chris Wilson <chris.p.wilson@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191025193746.47155-2-umesh.nerlige.ramappa@intel.com
The actual conditions are that we know the GPU is not accessing the
context, and we hold a pin on the context image to allow CPU access. We
used a fake lock on ce->pin_mutex so that we could try and use lockdep
to assert that access is serialised, but the various different
hardirq/softirq contexts where we need to *fake* holding the pin_mutex
are causing more trouble.
Still it would be nice if we did have a way to reassure ourselves that
the direct update to the context image is serialised with GPU execution.
In the meantime, stop lockdep complaining about false irq inversions.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111923
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191022122845.25038-1-chris@chris-wilson.co.uk
We would like to make use of perf in Vulkan. The Vulkan API is much
lower level than OpenGL, with applications directly exposed to the
concept of command buffers (pretty much equivalent to our batch
buffers). In Vulkan, queries are always limited in scope to a command
buffer. In OpenGL, the lack of command buffer concept meant that
queries' duration could span multiple command buffers.
With that restriction gone in Vulkan, we would like to simplify
measuring performance just by measuring the deltas between the counter
snapshots written by 2 MI_RECORD_PERF_COUNT commands, rather than the
more complex scheme we currently have in the GL driver, using 2
MI_RECORD_PERF_COUNT commands and doing some post processing on the
stream of OA reports, coming from the global OA buffer, to remove any
unrelated deltas in between the 2 MI_RECORD_PERF_COUNT.
Disabling preemption only apply to a single context with which want to
query performance counters for and is considered a privileged
operation, by default protected by CAP_SYS_ADMIN. It is possible to
enable it for a normal user by disabling the paranoid stream setting.
v2: Store preemption setting in intel_context (Chris)
v3: Use priorities to avoid preemption rather than the HW mechanism
v4: Just modify the port priority reporting function
v5: Add nopreempt flag on gem context and always flag requests
appropriately, regarless of OA reconfiguration.
Link: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/932
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191014201404.22468-4-chris@chris-wilson.co.uk
Listing configurations at the moment is supported only through sysfs.
This might cause issues for applications wanting to list
configurations from a container where sysfs isn't available.
This change adds a way to query the number of configurations and their
content through the i915 query uAPI.
v2: Fix sparse warnings (Lionel)
Add support to query configuration using uuid (Lionel)
v3: Fix some inconsistency in uapi header (Lionel)
Fix unlocking when not locked issue (Lionel)
Add debug messages (Lionel)
v4: Fix missing unlock (Dan)
v5: Drop lock when copying config content to userspace (Chris)
v6: Drop lock when copying config list to userspace (Chris)
Fix deadlock when calling i915_perf_get_oa_config() under
perf.metrics_lock (Lionel)
Add i915_oa_config_get() (Chris)
Link: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/932
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191014201404.22468-2-chris@chris-wilson.co.uk
We haven't run into issues with programming the global OA/NOA
registers configuration from CPU so far, but HW engineers actually
recommend doing this from the command streamer. On TGL in particular
one of the clock domain in which some of that programming goes might
not be powered when we poke things from the CPU.
Since we have a command buffer prepared for the execbuffer side of
things, we can reuse that approach here too.
This also allows us to significantly reduce the amount of time we hold
the main lock.
v2: Drop the global lock as much as possible
v3: Take global lock to pin global
v4: Create i915 request in emit_oa_config() to avoid deadlocks (Lionel)
v5: Move locking to the stream (Lionel)
v6: Move active reconfiguration request into i915_perf_stream (Lionel)
v7: Pin VMA outside request creation (Chris)
Lock VMA before move to active (Chris)
v8: Fix double free on stream->initial_oa_config_bo (Lionel)
Don't allow interruption when waiting on active config request
(Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191012072308.30312-3-chris@chris-wilson.co.uk
NOA configuration take some amount of time to apply. That amount of
time depends on the size of the GT. There is no documented time for
this. For example, past experimentations with powergating
configuration changes seem to indicate a 60~70us delay. We go with
500us as default for now which should be over the required amount of
time (according to HW architects).
v2: Don't forget to save/restore registers used for the wait (Chris)
v3: Name used CS_GPR registers (Chris)
Fix compile issue due to rebase (Lionel)
v4: Fix save/restore helpers (Umesh)
v5: Move noa_wait from drm_i915_private to i915_perf_stream (Lionel)
v6: Add missing struct declarations in i915_perf.h
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191012072308.30312-2-chris@chris-wilson.co.uk
Here we introduce a mechanism by which the execbuf part of the i915
driver will be able to request that a batch buffer containing the
programming for a particular OA config be created.
We'll execute these OA configuration buffers right before executing a
set of userspace commands so that a particular user batchbuffer be
executed with a given OA configuration.
This mechanism essentially allows the userspace driver to go through
several OA configuration without having to open/close the i915/perf
stream.
v2: No need for locking on object OA config object creation (Chris)
Flush cpu mapping of OA config (Chris)
v3: Properly deal with the perf_metric lock (Chris/Lionel)
v4: Fix oa config unref/put when not found (Lionel)
v5: Allocate BOs for configurations on the stream instead of globally
(Lionel)
v6: Fix 64bit division (Chris)
v7: Store allocated config BOs into the stream (Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191012072308.30312-1-chris@chris-wilson.co.uk
With the introduction of ctx->engines[] we allow multiple logical
contexts to be used on the same engine (e.g. with virtual engines).
According to bspec, aach logical context requires a unique tag in order
for context-switching to occur correctly between them. [Simple
experiments show that it is not so easy to trick the HW into performing
a lite-restore with matching logical IDs, though my memory from early
Broadwell experiments do suggest that it should be generating
lite-restores.]
We only need to keep a unique tag for the active lifetime of the
context, and for as long as we need to identify that context. The HW
uses the tag to determine if it should use a lite-restore (why not the
LRCA?) and passes the tag back for various status identifies. The only
status we need to track is for OA, so when using perf, we assign the
specific context a unique tag.
v2: Calculate required number of tags to fill ELSP.
Fixes: 976b55f0e1 ("drm/i915: Allow a context to define its set of engines")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111895
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-14-chris@chris-wilson.co.uk
Replace the struct_mutex requirement for pinning the i915_vma with the
local vm->mutex instead. Note that the vm->mutex is tainted by the
shrinker (we require unbinding from inside fs-reclaim) and so we cannot
allocate while holding that mutex. Instead we have to preallocate
workers to do allocate and apply the PTE updates after we have we
reserved their slot in the drm_mm (using fences to order the PTE writes
with the GPU work and with later unbind).
In adding the asynchronous vma binding, one subtle requirement is to
avoid coupling the binding fence into the backing object->resv. That is
the asynchronous binding only applies to the vma timeline itself and not
to the pages as that is a more global timeline (the binding of one vma
does not need to be ordered with another vma, nor does the implicit GEM
fencing depend on a vma, only on writes to the backing store). Keeping
the vma binding distinct from the backing store timelines is verified by
a number of async gem_exec_fence and gem_exec_schedule tests. The way we
do this is quite simple, we keep the fence for the vma binding separate
and only wait on it as required, and never add it to the obj->resv
itself.
Another consequence in reducing the locking around the vma is the
destruction of the vma is no longer globally serialised by struct_mutex.
A natural solution would be to add a kref to i915_vma, but that requires
decoupling the reference cycles, possibly by introducing a new
i915_mm_pages object that is own by both obj->mm and vma->pages.
However, we have not taken that route due to the overshadowing lmem/ttm
discussions, and instead play a series of complicated games with
trylocks to (hopefully) ensure that only one destruction path is called!
v2: Add some commentary, and some helpers to reduce patch churn.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-4-chris@chris-wilson.co.uk
Before we submit the first context to HW, we need to construct a valid
image of the register state. This layout is defined by the HW and should
match the layout generated by HW when it saves the context image.
Asserting that this should be equivalent should help avoid any undefined
behaviour and verify that we haven't missed anything important!
Of course, having insisted that the initial register state within the
LRC should match that returned by HW, we need to ensure that it does.
v2: Drop the RELATIVE_MMIO flag from gen11, we ignore it for
constructing the lrc image.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190924145950.3011-1-chris@chris-wilson.co.uk
It used to be handy that we only had a couple of headers, but over time
i915_drv.h has become unwieldy. Extract declarations to a separate
header file corresponding to the implementation module, clarifying the
modularity of the driver.
Ensure the new header is self-contained, and do so with minimal further
includes, using forward declarations as needed. Include the new header
only where needed, and sort the modified include directives while at it
and as needed.
No functional changes.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/d7826e365695f691a3ac69a69ff6f2bbdb62700d.1565271681.git.jani.nikula@intel.com
The oa object manages the oa buffer and must be allocated when the user
intends to read performance counter snapshots. This can be achieved by
making the oa object part of the stream object which is allocated when a
stream is opened by the user.
Attributes in the oa object that are gen-specific are moved to the perf
object so that they can be initialized on driver load.
The split provides a better separation of the objects used in perf
implementation of i915 driver so that resources are allocated and
initialized only when needed.
v2: Fix checkpatch warnings
v3: Addressed Lionel's review comment
v4: Rebase
v5: Fix rebase/merge issue with ratelimit_state_init
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190806233002.984-1-umesh.nerlige.ramappa@intel.com