Typically, there is space available within the ring and if not we have
to wait (by definition a slow path). Rearrange the code to reduce the
number of branches and stack size for the hotpath, accomodating a slight
growth for the wait.
v2: Fix the new assert that packets are not larger than the actual ring.
v3: Make the parameters unsigned as well to make usage.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170504130846.4807-3-chris@chris-wilson.co.uk
Since unifying ringbuffer/execlist submission to use
engine->pin_context, we ensure that the intel_ring is available before
we start constructing the request. We can therefore move the assignment
of the request->ring to the central i915_gem_request_alloc() and not
require it in every engine->request_alloc() callback. Another small step
towards simplification (of the core, but at a cost of handling error
pointers in less important callers of engine->pin_context).
v2: Rearrange a few branches to reduce impact of PTR_ERR() on gcc's code
generation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Oscar Mateo <oscar.mateo@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170504093308.4137-1-chris@chris-wilson.co.uk
The contents of a ring are only valid between HEAD and TAIL, when the
ring is idle (HEAD == TAIL) we can simply let the pages go under memory
pressure if they are not pinned by an active context. Any new content
will be written after HEAD and so the ring will again be valid between
HEAD and TAIL, everything outside can be discarded.
Note that we take care of ensuring that we do not reset the HEAD
backwards following a GPU hang on an idle ring.
The same precautions are what enable us to use stolen memory for rings.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/20170420101709.27250-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Or rather it is used only by intel_ring_pin() to extract the
drm_i915_private which we can easily pass in. As this is a relatively
rare operation, save the space in the struct, and as such it is even
break even in the extra code for passing around the parameter:
add/remove: 0/0 grow/shrink: 2/3 up/down: 15/-15 (0)
function old new delta
intel_init_ring_buffer 906 918 +12
execlists_context_pin 1308 1311 +3
mock_engine 407 403 -4
intel_engine_create_ring 367 363 -4
intel_ring_pin 326 319 -7
Total: Before=1261794, After=1261794, chg +0.00%
v2: Reorder intel_init_ring_buffer to keep the ring setup together:
add/remove: 0/0 grow/shrink: 2/3 up/down: 9/-15 (-6)
function old new delta
intel_init_ring_buffer 906 912 +6
execlists_context_pin 1308 1311 +3
mock_engine 407 403 -4
intel_engine_create_ring 367 363 -4
intel_ring_pin 326 319 -7
Total: Before=1261794, After=1261788, chg -0.00%
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170403113426.25707-1-chris@chris-wilson.co.uk
This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.
intel_ring_emit was preventing the compiler for optimising
fetch and increment of the current ring buffer pointer and
therefore generating very verbose code for every write.
It had no useful purpose since all ringbuffer operations
are started and ended with intel_ring_begin and
intel_ring_advance respectively, with no bail out in the
middle possible, so it is fine to increment the tail in
intel_ring_begin and let the code manage the pointer
itself.
Useless instruction removal amounts to approximately
two and half kilobytes of saved text on my build.
Not sure if this has any measurable performance
implications but executing a ton of useless instructions
on fast paths cannot be good.
v2:
* Change return from intel_ring_begin to error pointer by
popular demand.
* Move tail increment to intel_ring_advance to enable some
error checking.
v3:
* Move tail advance back into intel_ring_begin.
* Rebase and tidy.
v4:
* Complete rebase after a few months since v3.
v5:
* Remove unecessary cast and fix !debug compile. (Chris Wilson)
v6:
* Make intel_ring_offset take request as well.
* Fix recording of request postfix plus a sprinkle of asserts.
(Chris Wilson)
v7:
* Use intel_ring_offset to get the postfix. (Chris Wilson)
* Convert GVT code as well.
v8:
* Rename *out++ to *cs++.
v9:
* Fix GVT out to cs conversion in GVT.
v10:
* Rebase for new intel_ring_begin in selftests.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170214113242.29241-1-tvrtko.ursulin@linux.intel.com
Following a reset, the context and page directory registers are lost.
However, the queue of requests that we resubmit after the reset may
depend upon them - the registers are restored from a context image, but
that restore may be inhibited and may simply be absent from the request
if it was in the middle of a sequence using the same context. If we
prime the CCID/PD registers with the first request in the queue (even
for the hung request), we prevent invalid memory access for the
following requests (and continually hung engines).
v2: Magic BIT(8), reserved for future use but still appears unused.
v3: Some commentary on handling innocent vs guilty requests
v4: Add a wait for PD_BASE fetch. The reload appears to be instant on my
Ivybridge, but this bit probably exists for a reason.
Fixes: 821ed7df6e ("drm/i915: Update reset path to fix incomplete requests")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170207152437.4252-1-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
It is required that the caller declare the exact number of dwords they
wish to write into the ring. This is required for two reasons, we need
to allocate sufficient space for the entire command packet and we need
to be sure that the contents are completely written to avoid executing
stale data. The current interface requires for any bug to be caught in
review, the reader has to carefully count the number of
intel_ring_emit() between intel_ring_begin() and intel_ring_advance().
If we record the end of the packet of each intel_ring_begin() we can
also have CI check for us.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170206170502.30944-1-chris@chris-wilson.co.uk
With the introduce of i915_vma_instance() for obtaining the VMA
singleton for a (obj, vm, view) tuple, we can remove the
i915_vma_create() in favour of a single entry point. We do incur a
lookup onto an empty tree, but the i915_vma_create() were being called
infrequently and during initialisation, so the small overhead is
negligible.
v2: Drop the i915_ prefix from the now static vma_create() function
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170116152131.18089-4-chris@chris-wilson.co.uk
The WaDisableLSQCROPERFforOCL workaround has the side effect of
disabling an L3SQ optimization that has huge performance implications
and is unlikely to be necessary for the correct functioning of usual
graphic workloads. Userspace is free to re-enable the workaround on
demand, and is generally in a better position to determine whether the
workaround is necessary than the DRM is (e.g. only during the
execution of compute kernels that rely on both L3 fences and HDC R/W
requests).
The same workaround seems to apply to BDW (at least to production
stepping G1) and SKL as well (the internal workaround database claims
that it does for all steppings, while the BSpec workaround table only
mentions pre-production steppings), but the DRM doesn't do anything
beyond whitelisting the L3SQCREG4 register so userspace can enable it
when it sees fit. Do the same on KBL platforms.
Improves performance of the GFXBench4 gl_manhattan31 benchmark by 60%,
and gl_4 (AKA car chase) by 14% on a KBL GT2 running Mesa master --
This is followed by a regression of 35% and 10% respectively for the
same benchmarks and platform caused by my recent patch series
switching userspace to use the dataport constant cache instead of the
sampler to implement uniform pull constant loads, which caused us to
hit more heavily the L3 cache (and on platforms other than KBL had the
opposite effect of improving performance of the same two benchmarks).
The overall effect on KBL of this change combined with the recent
userspace change is respectively 4.6% and 2.6%. SynMark2 OglShMapPcf
was affected by the constant cache changes (though it improved as it
did on other platforms rather than regressing), but is not
significantly affected by this patch (with statistical significance of
5% and sample size 20).
v2: Drop some more code to avoid unused variable warning.
Fixes: 738fa1b312 ("drm/i915/kbl: Add WaDisableLSQCROPERFforOCL")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99256
Signed-off-by: Francisco Jerez <currojerez@riseup.net>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: beignet@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v4.7+
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
[Removed double Fixes tag]
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1484217894-20505-1-git-send-email-mika.kuoppala@intel.com
Start converting over from the byte count to its semantic macro, either
we want to allocate the size of a physical page in main memory or we
want the size of a virtual page in the GTT. 4096 could mean either, but
PAGE_SIZE and I915_GTT_PAGE_SIZE are explicit and should help improve
code comprehension and future changes. In the future, we may want to use
variable GTT page sizes and so have the challenge of knowing which
hardcoded values were used to represent a physical page vs the virtual
page.
v2: Look for a few more 4096s to convert, discover IS_ALIGNED().
v3: 4096ul paranoia, make fence alignment a distinct value of 4096, keep
bdw stolen w/a as 4096 until we know better.
v4: Add asserts that i915_vma_insert() start/end are aligned to GTT page
sizes.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/20170110144734.26052-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
The kernel context (dev_priv->kernel_context) is unique in that it is
not associated with any user filp - it is the only one with
ctx->file_priv == NULL. This is a simpler test than comparing it against
dev_priv->kernel_context which involves some pointer dancing.
In checking that this is true, we notice that the gvt context is
allocating itself a i915_hw_ppgtt it doesn't use and not flagging that
its file_priv should be invalid.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170106152013.24684-5-chris@chris-wilson.co.uk
The requests conversion introduced a nasty bug where we could generate a
new request in the middle of constructing a request if we needed to idle
the system in order to evict space for a context. The request to idle
would be executed (and waited upon) before the current one, creating a
minor havoc in the seqno accounting, as we will consider the current
request to already be completed (prior to deferred seqno assignment) but
ring->last_retired_head would have been updated and still could allow
us to overwrite the current request before execution.
We also employed two different mechanisms to track the active context
until it was switched out. The legacy method allowed for waiting upon an
active context (it could forcibly evict any vma, including context's),
but the execlists method took a step backwards by pinning the vma for
the entire active lifespan of the context (the only way to evict was to
idle the entire GPU, not individual contexts). However, to circumvent
the tricky issue of locking (i.e. we cannot take struct_mutex at the
time of i915_gem_request_submit(), where we would want to move the
previous context onto the active tracker and unpin it), we take the
execlists approach and keep the contexts pinned until retirement.
The benefit of the execlists approach, more important for execlists than
legacy, was the reduction in work in pinning the context for each
request - as the context was kept pinned until idle, it could short
circuit the pinning for all active contexts.
We introduce new engine vfuncs to pin and unpin the context
respectively. The context is pinned at the start of the request, and
only unpinned when the following request is retired (this ensures that
the context is idle and coherent in main memory before we unpin it). We
move the engine->last_context tracking into the retirement itself
(rather than during request submission) in order to allow the submission
to be reordered or unwound without undue difficultly.
And finally an ulterior motive for unifying context handling was to
prepare for mock requests.
v2: Rename to last_retired_context, split out legacy_context tracking
for MI_SET_CONTEXT.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161218153724.8439-3-chris@chris-wilson.co.uk
Where it is more appropriate and also to be consistent with
the direction of the driver.
v2: Leave out object alloc/free inlining. (Joonas Lahtinen)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Defer the transfer from the client's timeline onto the execution
timeline from the point of readiness to the point of actual submission.
For example, in execlists, a request is finally submitted to hardware
when the hardware is ready, and only put onto the hardware queue when
the request is ready. By deferring the transfer, we ensure that the
timeline is maintained in retirement order if we decide to queue the
requests onto the hardware in a different order than fifo.
v2: Rebased onto distinct global/user timeline lock classes.
v3: Play with the position of the spin_lock().
v4: Nesting finally resolved with distinct sw_fence lock classes.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161114204105.29171-4-chris@chris-wilson.co.uk
For legacy contexts we employ an optimisation to only flush the context
when binding into the global GTT. This avoids stalling on the GPU when
reloading an active context. Wrap this detail up into a helper and
export it for a potential third user. (Longer term, context pinning
needs to be reworked as the current handling of switch context pins too
late and so risks eviction and corrupting the request. Plans, plans,
plans.)
v2: Expand the comment explaining the optimisation for avoiding the
stall on active contexts.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/20161030132820.32163-1-chris@chris-wilson.co.uk
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Move the actual emission of the breadcrumb for closing the request from
i915_add_request() to the submit callback. (It can be moved later when
required.) This allows us to defer the allocation of the global_seqno
from request construction to actual submission, allowing us to emit the
requests out of order (wrt to the order of their construction, they
still will only be executed one all of their dependencies are resolved
including that all earlier requests on their timeline have been
submitted.) We have to specialise how we then emit the request in order
to write into the preallocated space, rather than at the tail of the
ringbuffer (which will have been advanced by the addition of new
requests).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-29-chris@chris-wilson.co.uk