All the requests at the same priority are executed in FIFO order. They
do not need to be stored in the rbtree themselves, as they are a simple
list within a level. If we move the requests at one priority into a list,
we can then reduce the rbtree to the set of priorities. This should keep
the height of the rbtree small, as the number of active priorities can not
exceed the number of active requests and should be typically only a few.
Currently, we have ~2k possible different priority levels, that may
increase to allow even more fine grained selection. Allocating those in
advance seems a waste (and may be impossible), so we opt for allocating
upon first use, and freeing after its requests are depleted. To avoid
the possibility of an allocation failure causing us to lose a request,
we preallocate the default priority (0) and bump any request to that
priority if we fail to allocate it the appropriate plist. Having a
request (that is ready to run, so not leading to corruption) execute
out-of-order is better than leaking the request (and its dependency
tree) entirely.
There should be a benefit to reducing execlists_dequeue() to principally
using a simple list (and reducing the frequency of both rbtree iteration
and balancing on erase) but for typical workloads, request coalescing
should be small enough that we don't notice any change. The main gain is
from improving PI calls to schedule, and the explicit list within a
level should make request unwinding simpler (we just need to insert at
the head of the list rather than the tail and not have to make the
rbtree search more complicated).
v2: Avoid use-after-free when deleting a depleted priolist
v3: Michał found the solution to handling the allocation failure
gracefully. If we disable all priority scheduling following the
allocation failure, those requests will be executed in fifo and we will
ensure that this request and its dependencies are in strict fifo (even
when it doesn't realise it is only a single list). Normal scheduling is
restored once we know the device is idle, until the next failure!
Suggested-by: Michał Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
function old new delta
execlists_submit_ports 262 471 +209
port_assign.isra - 136 +136
capture 6344 6359 +15
reset_common_ring 438 452 +14
execlists_submit_request 228 238 +10
gen8_init_common_ring 334 341 +7
intel_engine_is_idle 106 105 -1
i915_engine_info 2314 2290 -24
__i915_gem_set_wedged_BKL 485 411 -74
intel_lrc_irq_handler 1789 1604 -185
execlists_update_context 294 - -294
The most important change there is the improve to the
intel_lrc_irq_handler and excclist_submit_ports (net improvement since
execlists_update_context is now inlined).
v2: Use the port_api() for guc as well (even though currently we do not
pack any counters in there, yet) and hide all port->request_count inside
the helpers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-5-chris@chris-wilson.co.uk
Typically, there is space available within the ring and if not we have
to wait (by definition a slow path). Rearrange the code to reduce the
number of branches and stack size for the hotpath, accomodating a slight
growth for the wait.
v2: Fix the new assert that packets are not larger than the actual ring.
v3: Make the parameters unsigned as well to make usage.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170504130846.4807-3-chris@chris-wilson.co.uk
Since unifying ringbuffer/execlist submission to use
engine->pin_context, we ensure that the intel_ring is available before
we start constructing the request. We can therefore move the assignment
of the request->ring to the central i915_gem_request_alloc() and not
require it in every engine->request_alloc() callback. Another small step
towards simplification (of the core, but at a cost of handling error
pointers in less important callers of engine->pin_context).
v2: Rearrange a few branches to reduce impact of PTR_ERR() on gcc's code
generation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Oscar Mateo <oscar.mateo@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170504093308.4137-1-chris@chris-wilson.co.uk
Or rather it is used only by intel_ring_pin() to extract the
drm_i915_private which we can easily pass in. As this is a relatively
rare operation, save the space in the struct, and as such it is even
break even in the extra code for passing around the parameter:
add/remove: 0/0 grow/shrink: 2/3 up/down: 15/-15 (0)
function old new delta
intel_init_ring_buffer 906 918 +12
execlists_context_pin 1308 1311 +3
mock_engine 407 403 -4
intel_engine_create_ring 367 363 -4
intel_ring_pin 326 319 -7
Total: Before=1261794, After=1261794, chg +0.00%
v2: Reorder intel_init_ring_buffer to keep the ring setup together:
add/remove: 0/0 grow/shrink: 2/3 up/down: 9/-15 (-6)
function old new delta
intel_init_ring_buffer 906 912 +6
execlists_context_pin 1308 1311 +3
mock_engine 407 403 -4
intel_engine_create_ring 367 363 -4
intel_ring_pin 326 319 -7
Total: Before=1261794, After=1261788, chg -0.00%
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170403113426.25707-1-chris@chris-wilson.co.uk
If the request->wa_tail is 0 (because it landed exactly on the end of
the ringbuffer), when we reconstruct request->tail following a reset we
fill in an illegal value (-8 or 0x001ffff8). As a result, RING_HEAD is
never able to catch up with RING_TAIL and the GPU spins endlessly. If
the ring contains a couple of breadcrumbs, even our hangcheck is unable
to catch the busy-looping as the ACTHD and seqno continually advance.
v2: Move the wrap into a common intel_ring_wrap().
Fixes: a3aabe86a3 ("drm/i915/execlists: Reinitialise context image after GPU hang")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: <stable@vger.kernel.org> # v4.10+
Link: http://patchwork.freedesktop.org/patch/msgid/20170327130009.4678-1-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
GVTg has introduced the context status notifier to schedule the GVTg
workload. At that time, the notifier is bound to GVTg context only,
so GVTg is not aware of host workloads.
Now we are going to improve GVTg's guest workload scheduler policy,
and add Guc emulation support for new Gen graphics. Both these two
features require acknowledgment for all contexts running on hardware.
(But will not alter host workload.) So here try to make some change.
The change is simple:
1. Move the context status notifier head from i915_gem_context to
intel_engine_cs. Which means there is a notifier head per engine
instead of per context. Execlist driver still call notifier for
each context sched-in/out events of current engine.
2. At GVTg side, it binds a notifier_block for each physical engine
at GVTg initialization period. Then GVTg can hear all context
status events.
In this patch, GVTg do nothing for host context event, but later
will add a function there. But in any case, the notifier callback is
a noop if this is no active vGPU.
Since intel_gvt_init() is called at early initialization stage and
require the status notifier head has been initiated, I initiate it in
intel_engine_setup().
v2: remove a redundant newline. (chris)
Fixes: 3c7ba6359d ("drm/i915: Introduce execlist context status change notification")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100232
Signed-off-by: Changbin Du <changbin.du@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/20170313024711.28591-1-changbin.du@intel.com
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
During reset_all_global_seqno() on seqno rollover, we have to update the
HWS. This causes all in flight requests to be completed, so first we
wait. However, we were only waiting for the requests themselves to be
completed and clearing out the waiter rbtrees - what I had missed was
the extra reference in execlists->port[]. Since commit fe9ae7a3bf
("drm/i915/execlists: Detect an out-of-order context switch") we can
detect when the request is retired before the context switch interrupt
is completed. The impact should be neglible outside of debugging.
Testcase: igt/gem_exec_whisper
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170303121947.20482-1-chris@chris-wilson.co.uk
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
A significant cost in setting up a wait is the overhead of enabling the
interrupt. As we disable the interrupt whenever the queue of waiters is
empty, if we are frequently waiting on alternating batches, we end up
re-enabling the interrupt on a frequent basis. We do want to disable the
interrupt during normal operations as under high load it may add several
thousand interrupts/s - we have been known in the past to occupy whole
cores with our interrupt handler after accidentally leaving user
interrupts enabled. As a compromise, leave the interrupt enabled until
the next IRQ, or the system is idle. This gives a small window for a
waiter to keep the interrupt active and not be delayed by having to
re-enable the interrupt.
v2: Restore hangcheck/missed-irq detection for continuations
v3: Be more careful restoring the hangcheck timer after reset
v4: Be more careful restoring the fake irq after reset (if required!)
v5: Redo changes to intel_engine_wakeup()
v6: Factor out __intel_engine_wakeup()
v7: Improve commentary for declaring a missed wakeup
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170227205850.2828-4-chris@chris-wilson.co.uk
As execlists and other non-semaphore multi-engine devices coordinate
between engines using interrupts, we can shave off a few 10s of
microsecond of scheduling latency by doing the fence signaling from the
interrupt as opposed to a RT kthread. (Realistically the delay adds
about 1% to an individual cross-engine workload.) We only signal the
first fence in order to limit the amount of work we move into the
interrupt handler. We also have to remember that our breadcrumbs may be
unordered with respect to the interrupt and so we still require the
waiter process to perform some heavyweight coherency fixups, as well as
traversing the tree of waiters.
v2: No need for early exit in irq handler - it breaks the flow between
patches and prevents the tracepoint
v3: Restore rcu hold across irq signaling of request
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170227205850.2828-2-chris@chris-wilson.co.uk
The two users of the return value from intel_engine_wakeup() are
expecting different results. In the breadcrumbs hangcheck, we are using
it to determine whether wake_up_process() detected the waiter was
currently running (and if so we presume that it hasn't yet missed the
interrupt). However, in the fake_irq path, we are using the return value
as a check as to whether there are any waiters, and so we may
incorrectly stop the fake-irq if that waiter was currently running.
To handle the two different needs, return both bits of information! We
uninline it from the irq path in preparation for the next patch which
makes the irq hotpath special and relegates intel_engine_wakeup() to the
slow fixup paths.
v2: s/ret/result/
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170227205850.2828-1-chris@chris-wilson.co.uk
A request is assigned a global seqno only when it is on the hardware
execution queue. The global seqno can be used to maintain a list of
requests on the same engine in retirement order, for example for
constructing a priority queue for waiting. Prior to its execution, or
if it is subsequently removed in the event of preemption, its global
seqno is zero. As both insertion and removal from the execution queue
may operate in IRQ context, it is not guarded by the usual struct_mutex
BKL. Instead those relying on the global seqno must be prepared for its
value to change between reads. Only when the request is complete can
the global seqno be stable (due to the memory barriers on submitting
the commands to the hardware to write the breadcrumb, if the HWS shows
that it has passed the global seqno and the global seqno is unchanged
after the read, it is indeed complete).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170223074422.4125-9-chris@chris-wilson.co.uk
Replace the global device seqno with one for each engine, and account
for in-flight seqno on each separately. This is consistent with
dma-fence as each timeline has separate fence-contexts for each engine
and a seqno is only ordered within a fence-context (i.e. seqno do not
need to be ordered wrt to other engines, just ordered within a single
engine). This is required to enable request rewinding for preemption on
individual engines (we have to rewind the global seqno to avoid
overflow, and we do not have to rewind all engines just to preempt one.)
v2: Rename active_seqno to inflight_seqnos to more clearly indicate that
it is a counter and not equivalent to the existing seqno. Update
functions that operated on active_seqno similarly.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170223074422.4125-3-chris@chris-wilson.co.uk
When the timer expires for checking on interrupt processing, check to
see if any interrupts arrived within the last time period. If real
interrupts are still being delivered, we can be reassured that we
haven't missed the final interrupt as the waiter will still be woken.
Only once all activity ceases, do we have to worry about the waiter
never being woken and so need to install a timer to kick the waiter for
a slow arrival of a seqno.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170217151304.16665-2-chris@chris-wilson.co.uk
This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.
intel_ring_emit was preventing the compiler for optimising
fetch and increment of the current ring buffer pointer and
therefore generating very verbose code for every write.
It had no useful purpose since all ringbuffer operations
are started and ended with intel_ring_begin and
intel_ring_advance respectively, with no bail out in the
middle possible, so it is fine to increment the tail in
intel_ring_begin and let the code manage the pointer
itself.
Useless instruction removal amounts to approximately
two and half kilobytes of saved text on my build.
Not sure if this has any measurable performance
implications but executing a ton of useless instructions
on fast paths cannot be good.
v2:
* Change return from intel_ring_begin to error pointer by
popular demand.
* Move tail increment to intel_ring_advance to enable some
error checking.
v3:
* Move tail advance back into intel_ring_begin.
* Rebase and tidy.
v4:
* Complete rebase after a few months since v3.
v5:
* Remove unecessary cast and fix !debug compile. (Chris Wilson)
v6:
* Make intel_ring_offset take request as well.
* Fix recording of request postfix plus a sprinkle of asserts.
(Chris Wilson)
v7:
* Use intel_ring_offset to get the postfix. (Chris Wilson)
* Convert GVT code as well.
v8:
* Rename *out++ to *cs++.
v9:
* Fix GVT out to cs conversion in GVT.
v10:
* Rebase for new intel_ring_begin in selftests.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170214113242.29241-1-tvrtko.ursulin@linux.intel.com
In commit 86aa7e760a ("drm/i915: Assert that the context-switch
completion matches our context") I added a read to the irq tasklet
handler that compared the on-chip status with that of our sw tracking,
using an unguarded read of the request pointer to get the context and
beyond. Whilst we hold a reference to the request, we do not hold
anything on the context and if we are unlucky it may be reaped from a
second thread retiring the request (since it may retire the request as
soon as the breadcrumb is complete, even before we finish processing the
context switch) as we try to read from the context pointer.
Avoid the racy read from underneath the request by storing the expected
result in the execlist_port[].
v2: Include commentary about port[].request being unprotected.
Fixes: 86aa7e760a ("drm/i915: Assert that the context-switch completion matches our context")
Reported-by: Mika Kuoppala <mika.kuoppala@intel.com>
Testcase: igt/gem_ctx_create
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170206170502.30944-2-chris@chris-wilson.co.uk
It is required that the caller declare the exact number of dwords they
wish to write into the ring. This is required for two reasons, we need
to allocate sufficient space for the entire command packet and we need
to be sure that the contents are completely written to avoid executing
stale data. The current interface requires for any bug to be caught in
review, the reader has to carefully count the number of
intel_ring_emit() between intel_ring_begin() and intel_ring_advance().
If we record the end of the packet of each intel_ring_begin() we can
also have CI check for us.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170206170502.30944-1-chris@chris-wilson.co.uk
We need to prevent resubmission of the context immediately following an
initial resubmit (which does a lite-restore preemption). Currently we do
this by disabling all submission whilst the context is still active, but
we can improve this by limiting the restriction to only until we
receive notification from the context-switch interrupt that the
lite-restore preemption is complete.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170124110009.28947-2-chris@chris-wilson.co.uk