Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "The main changes are: - Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems, to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen, Valentin Schneider) - Topology handling improvements, in particular when CPU capacity changes and related load-balancing fixes/improvements (Morten Rasmussen) - ... plus misc other improvements, fixes and updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions sched/completions/Documentation: Clean up the document some more sched/completions/Documentation: Fix a couple of punctuation nits cpu/SMT: State SMT is disabled even with nosmt and without "=force" sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load() sched/fair: Remove setting task's se->runnable_weight during PELT update sched/fair: Disable LB_BIAS by default sched/pelt: Fix warning and clean up IRQ PELT config sched/topology: Make local variables static sched/debug: Use symbolic names for task state constants sched/numa: Remove unused numa_stats::nr_running field sched/numa: Remove unused code from update_numa_stats() sched/debug: Explicitly cast sched_feat() to bool sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains sched/fair: Don't move tasks to lower capacity CPUs unless necessary sched/fair: Set rq->rd->overload when misfit sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE() sched/core: Change root_domain->overload type to int sched/fair: Change 'prefer_sibling' type to bool sched/fair: Kick nohz balance if rq->misfit_task_load ...
This commit is contained in:
@@ -1,146 +1,187 @@
|
||||
completions - wait for completion handling
|
||||
==========================================
|
||||
|
||||
This document was originally written based on 3.18.0 (linux-next)
|
||||
Completions - "wait for completion" barrier APIs
|
||||
================================================
|
||||
|
||||
Introduction:
|
||||
-------------
|
||||
|
||||
If you have one or more threads of execution that must wait for some process
|
||||
If you have one or more threads that must wait for some kernel activity
|
||||
to have reached a point or a specific state, completions can provide a
|
||||
race-free solution to this problem. Semantically they are somewhat like a
|
||||
pthread_barrier and have similar use-cases.
|
||||
pthread_barrier() and have similar use-cases.
|
||||
|
||||
Completions are a code synchronization mechanism which is preferable to any
|
||||
misuse of locks. Any time you think of using yield() or some quirky
|
||||
msleep(1) loop to allow something else to proceed, you probably want to
|
||||
look into using one of the wait_for_completion*() calls instead. The
|
||||
advantage of using completions is clear intent of the code, but also more
|
||||
efficient code as both threads can continue until the result is actually
|
||||
needed.
|
||||
misuse of locks/semaphores and busy-loops. Any time you think of using
|
||||
yield() or some quirky msleep(1) loop to allow something else to proceed,
|
||||
you probably want to look into using one of the wait_for_completion*()
|
||||
calls and complete() instead.
|
||||
|
||||
Completions are built on top of the generic event infrastructure in Linux,
|
||||
with the event reduced to a simple flag (appropriately called "done") in
|
||||
struct completion that tells the waiting threads of execution if they
|
||||
can continue safely.
|
||||
The advantage of using completions is that they have a well defined, focused
|
||||
purpose which makes it very easy to see the intent of the code, but they
|
||||
also result in more efficient code as all threads can continue execution
|
||||
until the result is actually needed, and both the waiting and the signalling
|
||||
is highly efficient using low level scheduler sleep/wakeup facilities.
|
||||
|
||||
As completions are scheduling related, the code is found in
|
||||
Completions are built on top of the waitqueue and wakeup infrastructure of
|
||||
the Linux scheduler. The event the threads on the waitqueue are waiting for
|
||||
is reduced to a simple flag in 'struct completion', appropriately called "done".
|
||||
|
||||
As completions are scheduling related, the code can be found in
|
||||
kernel/sched/completion.c.
|
||||
|
||||
|
||||
Usage:
|
||||
------
|
||||
|
||||
There are three parts to using completions, the initialization of the
|
||||
struct completion, the waiting part through a call to one of the variants of
|
||||
wait_for_completion() and the signaling side through a call to complete()
|
||||
or complete_all(). Further there are some helper functions for checking the
|
||||
state of completions.
|
||||
There are three main parts to using completions:
|
||||
|
||||
To use completions one needs to include <linux/completion.h> and
|
||||
create a variable of type struct completion. The structure used for
|
||||
handling of completions is:
|
||||
- the initialization of the 'struct completion' synchronization object
|
||||
- the waiting part through a call to one of the variants of wait_for_completion(),
|
||||
- the signaling side through a call to complete() or complete_all().
|
||||
|
||||
There are also some helper functions for checking the state of completions.
|
||||
Note that while initialization must happen first, the waiting and signaling
|
||||
part can happen in any order. I.e. it's entirely normal for a thread
|
||||
to have marked a completion as 'done' before another thread checks whether
|
||||
it has to wait for it.
|
||||
|
||||
To use completions you need to #include <linux/completion.h> and
|
||||
create a static or dynamic variable of type 'struct completion',
|
||||
which has only two fields:
|
||||
|
||||
struct completion {
|
||||
unsigned int done;
|
||||
wait_queue_head_t wait;
|
||||
};
|
||||
|
||||
providing the wait queue to place tasks on for waiting and the flag for
|
||||
indicating the state of affairs.
|
||||
This provides the ->wait waitqueue to place tasks on for waiting (if any), and
|
||||
the ->done completion flag for indicating whether it's completed or not.
|
||||
|
||||
Completions should be named to convey the intent of the waiter. A good
|
||||
example is:
|
||||
Completions should be named to refer to the event that is being synchronized on.
|
||||
A good example is:
|
||||
|
||||
wait_for_completion(&early_console_added);
|
||||
|
||||
complete(&early_console_added);
|
||||
|
||||
Good naming (as always) helps code readability.
|
||||
Good, intuitive naming (as always) helps code readability. Naming a completion
|
||||
'complete' is not helpful unless the purpose is super obvious...
|
||||
|
||||
|
||||
Initializing completions:
|
||||
-------------------------
|
||||
|
||||
Initialization of dynamically allocated completions, often embedded in
|
||||
other structures, is done with:
|
||||
Dynamically allocated completion objects should preferably be embedded in data
|
||||
structures that are assured to be alive for the life-time of the function/driver,
|
||||
to prevent races with asynchronous complete() calls from occurring.
|
||||
|
||||
void init_completion(&done);
|
||||
Particular care should be taken when using the _timeout() or _killable()/_interruptible()
|
||||
variants of wait_for_completion(), as it must be assured that memory de-allocation
|
||||
does not happen until all related activities (complete() or reinit_completion())
|
||||
have taken place, even if these wait functions return prematurely due to a timeout
|
||||
or a signal triggering.
|
||||
|
||||
Initialization is accomplished by initializing the wait queue and setting
|
||||
the default state to "not available", that is, "done" is set to 0.
|
||||
Initializing of dynamically allocated completion objects is done via a call to
|
||||
init_completion():
|
||||
|
||||
init_completion(&dynamic_object->done);
|
||||
|
||||
In this call we initialize the waitqueue and set ->done to 0, i.e. "not completed"
|
||||
or "not done".
|
||||
|
||||
The re-initialization function, reinit_completion(), simply resets the
|
||||
done element to "not available", thus again to 0, without touching the
|
||||
wait queue. Calling init_completion() twice on the same completion object is
|
||||
->done field to 0 ("not done"), without touching the waitqueue.
|
||||
Callers of this function must make sure that there are no racy
|
||||
wait_for_completion() calls going on in parallel.
|
||||
|
||||
Calling init_completion() on the same completion object twice is
|
||||
most likely a bug as it re-initializes the queue to an empty queue and
|
||||
enqueued tasks could get "lost" - use reinit_completion() in that case.
|
||||
enqueued tasks could get "lost" - use reinit_completion() in that case,
|
||||
but be aware of other races.
|
||||
|
||||
For static declaration and initialization, macros are available. These are:
|
||||
For static declaration and initialization, macros are available.
|
||||
|
||||
static DECLARE_COMPLETION(setup_done)
|
||||
For static (or global) declarations in file scope you can use DECLARE_COMPLETION():
|
||||
|
||||
used for static declarations in file scope. Within functions the static
|
||||
initialization should always use:
|
||||
static DECLARE_COMPLETION(setup_done);
|
||||
DECLARE_COMPLETION(setup_done);
|
||||
|
||||
Note that in this case the completion is boot time (or module load time)
|
||||
initialized to 'not done' and doesn't require an init_completion() call.
|
||||
|
||||
When a completion is declared as a local variable within a function,
|
||||
then the initialization should always use DECLARE_COMPLETION_ONSTACK()
|
||||
explicitly, not just to make lockdep happy, but also to make it clear
|
||||
that limited scope had been considered and is intentional:
|
||||
|
||||
DECLARE_COMPLETION_ONSTACK(setup_done)
|
||||
|
||||
suitable for automatic/local variables on the stack and will make lockdep
|
||||
happy. Note also that one needs to make *sure* the completion passed to
|
||||
work threads remains in-scope, and no references remain to on-stack data
|
||||
when the initiating function returns.
|
||||
Note that when using completion objects as local variables you must be
|
||||
acutely aware of the short life time of the function stack: the function
|
||||
must not return to a calling context until all activities (such as waiting
|
||||
threads) have ceased and the completion object is completely unused.
|
||||
|
||||
Using on-stack completions for code that calls any of the _timeout or
|
||||
_interruptible/_killable variants is not advisable as they will require
|
||||
additional synchronization to prevent the on-stack completion object in
|
||||
the timeout/signal cases from going out of scope. Consider using dynamically
|
||||
allocated completions when intending to use the _interruptible/_killable
|
||||
or _timeout variants of wait_for_completion().
|
||||
To emphasise this again: in particular when using some of the waiting API variants
|
||||
with more complex outcomes, such as the timeout or signalling (_timeout(),
|
||||
_killable() and _interruptible()) variants, the wait might complete
|
||||
prematurely while the object might still be in use by another thread - and a return
|
||||
from the wait_on_completion*() caller function will deallocate the function
|
||||
stack and cause subtle data corruption if a complete() is done in some
|
||||
other thread. Simple testing might not trigger these kinds of races.
|
||||
|
||||
If unsure, use dynamically allocated completion objects, preferably embedded
|
||||
in some other long lived object that has a boringly long life time which
|
||||
exceeds the life time of any helper threads using the completion object,
|
||||
or has a lock or other synchronization mechanism to make sure complete()
|
||||
is not called on a freed object.
|
||||
|
||||
A naive DECLARE_COMPLETION() on the stack triggers a lockdep warning.
|
||||
|
||||
Waiting for completions:
|
||||
------------------------
|
||||
|
||||
For a thread of execution to wait for some concurrent work to finish, it
|
||||
calls wait_for_completion() on the initialized completion structure.
|
||||
For a thread to wait for some concurrent activity to finish, it
|
||||
calls wait_for_completion() on the initialized completion structure:
|
||||
|
||||
void wait_for_completion(struct completion *done)
|
||||
|
||||
A typical usage scenario is:
|
||||
|
||||
CPU#1 CPU#2
|
||||
|
||||
struct completion setup_done;
|
||||
|
||||
init_completion(&setup_done);
|
||||
initialize_work(...,&setup_done,...)
|
||||
initialize_work(...,&setup_done,...);
|
||||
|
||||
/* run non-dependent code */ /* do setup */
|
||||
/* run non-dependent code */ /* do setup */
|
||||
|
||||
wait_for_completion(&setup_done); complete(setup_done)
|
||||
wait_for_completion(&setup_done); complete(setup_done);
|
||||
|
||||
This is not implying any temporal order on wait_for_completion() and the
|
||||
call to complete() - if the call to complete() happened before the call
|
||||
This is not implying any particular order between wait_for_completion() and
|
||||
the call to complete() - if the call to complete() happened before the call
|
||||
to wait_for_completion() then the waiting side simply will continue
|
||||
immediately as all dependencies are satisfied if not it will block until
|
||||
immediately as all dependencies are satisfied; if not, it will block until
|
||||
completion is signaled by complete().
|
||||
|
||||
Note that wait_for_completion() is calling spin_lock_irq()/spin_unlock_irq(),
|
||||
so it can only be called safely when you know that interrupts are enabled.
|
||||
Calling it from hard-irq or irqs-off atomic contexts will result in
|
||||
hard-to-detect spurious enabling of interrupts.
|
||||
|
||||
wait_for_completion():
|
||||
|
||||
void wait_for_completion(struct completion *done):
|
||||
Calling it from IRQs-off atomic contexts will result in hard-to-detect
|
||||
spurious enabling of interrupts.
|
||||
|
||||
The default behavior is to wait without a timeout and to mark the task as
|
||||
uninterruptible. wait_for_completion() and its variants are only safe
|
||||
in process context (as they can sleep) but not in atomic context,
|
||||
interrupt context, with disabled irqs. or preemption is disabled - see also
|
||||
interrupt context, with disabled IRQs, or preemption is disabled - see also
|
||||
try_wait_for_completion() below for handling completion in atomic/interrupt
|
||||
context.
|
||||
|
||||
As all variants of wait_for_completion() can (obviously) block for a long
|
||||
time, you probably don't want to call this with held mutexes.
|
||||
time depending on the nature of the activity they are waiting for, so in
|
||||
most cases you probably don't want to call this with held mutexes.
|
||||
|
||||
|
||||
Variants available:
|
||||
-------------------
|
||||
wait_for_completion*() variants available:
|
||||
------------------------------------------
|
||||
|
||||
The below variants all return status and this status should be checked in
|
||||
most(/all) cases - in cases where the status is deliberately not checked you
|
||||
@@ -148,51 +189,53 @@ probably want to make a note explaining this (e.g. see
|
||||
arch/arm/kernel/smp.c:__cpu_up()).
|
||||
|
||||
A common problem that occurs is to have unclean assignment of return types,
|
||||
so care should be taken with assigning return-values to variables of proper
|
||||
type. Checking for the specific meaning of return values also has been found
|
||||
to be quite inaccurate e.g. constructs like
|
||||
if (!wait_for_completion_interruptible_timeout(...)) would execute the same
|
||||
code path for successful completion and for the interrupted case - which is
|
||||
probably not what you want.
|
||||
so take care to assign return-values to variables of the proper type.
|
||||
|
||||
Checking for the specific meaning of return values also has been found
|
||||
to be quite inaccurate, e.g. constructs like:
|
||||
|
||||
if (!wait_for_completion_interruptible_timeout(...))
|
||||
|
||||
... would execute the same code path for successful completion and for the
|
||||
interrupted case - which is probably not what you want.
|
||||
|
||||
int wait_for_completion_interruptible(struct completion *done)
|
||||
|
||||
This function marks the task TASK_INTERRUPTIBLE. If a signal was received
|
||||
while waiting it will return -ERESTARTSYS; 0 otherwise.
|
||||
This function marks the task TASK_INTERRUPTIBLE while it is waiting.
|
||||
If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise.
|
||||
|
||||
unsigned long wait_for_completion_timeout(struct completion *done,
|
||||
unsigned long timeout)
|
||||
unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
|
||||
|
||||
The task is marked as TASK_UNINTERRUPTIBLE and will wait at most 'timeout'
|
||||
(in jiffies). If timeout occurs it returns 0 else the remaining time in
|
||||
jiffies (but at least 1). Timeouts are preferably calculated with
|
||||
msecs_to_jiffies() or usecs_to_jiffies(). If the returned timeout value is
|
||||
deliberately ignored a comment should probably explain why (e.g. see
|
||||
drivers/mfd/wm8350-core.c wm8350_read_auxadc())
|
||||
jiffies. If a timeout occurs it returns 0, else the remaining time in
|
||||
jiffies (but at least 1).
|
||||
|
||||
long wait_for_completion_interruptible_timeout(
|
||||
struct completion *done, unsigned long timeout)
|
||||
Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies(),
|
||||
to make the code largely HZ-invariant.
|
||||
|
||||
If the returned timeout value is deliberately ignored a comment should probably explain
|
||||
why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc()).
|
||||
|
||||
long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
|
||||
|
||||
This function passes a timeout in jiffies and marks the task as
|
||||
TASK_INTERRUPTIBLE. If a signal was received it will return -ERESTARTSYS;
|
||||
otherwise it returns 0 if the completion timed out or the remaining time in
|
||||
otherwise it returns 0 if the completion timed out, or the remaining time in
|
||||
jiffies if completion occurred.
|
||||
|
||||
Further variants include _killable which uses TASK_KILLABLE as the
|
||||
designated tasks state and will return -ERESTARTSYS if it is interrupted or
|
||||
else 0 if completion was achieved. There is a _timeout variant as well:
|
||||
designated tasks state and will return -ERESTARTSYS if it is interrupted,
|
||||
or 0 if completion was achieved. There is a _timeout variant as well:
|
||||
|
||||
long wait_for_completion_killable(struct completion *done)
|
||||
long wait_for_completion_killable_timeout(struct completion *done,
|
||||
unsigned long timeout)
|
||||
long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
|
||||
|
||||
The _io variants wait_for_completion_io() behave the same as the non-_io
|
||||
variants, except for accounting waiting time as waiting on IO, which has
|
||||
an impact on how the task is accounted in scheduling stats.
|
||||
variants, except for accounting waiting time as 'waiting on IO', which has
|
||||
an impact on how the task is accounted in scheduling/IO stats:
|
||||
|
||||
void wait_for_completion_io(struct completion *done)
|
||||
unsigned long wait_for_completion_io_timeout(struct completion *done
|
||||
unsigned long timeout)
|
||||
unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
|
||||
|
||||
|
||||
Signaling completions:
|
||||
@@ -200,31 +243,31 @@ Signaling completions:
|
||||
|
||||
A thread that wants to signal that the conditions for continuation have been
|
||||
achieved calls complete() to signal exactly one of the waiters that it can
|
||||
continue.
|
||||
continue:
|
||||
|
||||
void complete(struct completion *done)
|
||||
|
||||
or calls complete_all() to signal all current and future waiters.
|
||||
... or calls complete_all() to signal all current and future waiters:
|
||||
|
||||
void complete_all(struct completion *done)
|
||||
|
||||
The signaling will work as expected even if completions are signaled before
|
||||
a thread starts waiting. This is achieved by the waiter "consuming"
|
||||
(decrementing) the done element of struct completion. Waiting threads
|
||||
(decrementing) the done field of 'struct completion'. Waiting threads
|
||||
wakeup order is the same in which they were enqueued (FIFO order).
|
||||
|
||||
If complete() is called multiple times then this will allow for that number
|
||||
of waiters to continue - each call to complete() will simply increment the
|
||||
done element. Calling complete_all() multiple times is a bug though. Both
|
||||
complete() and complete_all() can be called in hard-irq/atomic context safely.
|
||||
done field. Calling complete_all() multiple times is a bug though. Both
|
||||
complete() and complete_all() can be called in IRQ/atomic context safely.
|
||||
|
||||
There only can be one thread calling complete() or complete_all() on a
|
||||
particular struct completion at any time - serialized through the wait
|
||||
There can only be one thread calling complete() or complete_all() on a
|
||||
particular 'struct completion' at any time - serialized through the wait
|
||||
queue spinlock. Any such concurrent calls to complete() or complete_all()
|
||||
probably are a design bug.
|
||||
|
||||
Signaling completion from hard-irq context is fine as it will appropriately
|
||||
lock with spin_lock_irqsave/spin_unlock_irqrestore and it will never sleep.
|
||||
Signaling completion from IRQ context is fine as it will appropriately
|
||||
lock with spin_lock_irqsave()/spin_unlock_irqrestore() and it will never sleep.
|
||||
|
||||
|
||||
try_wait_for_completion()/completion_done():
|
||||
@@ -236,7 +279,7 @@ else it consumes one posted completion and returns true.
|
||||
|
||||
bool try_wait_for_completion(struct completion *done)
|
||||
|
||||
Finally, to check the state of a completion without changing it in any way,
|
||||
Finally, to check the state of a completion without changing it in any way,
|
||||
call completion_done(), which returns false if there are no posted
|
||||
completions that were not yet consumed by waiters (implying that there are
|
||||
waiters) and true otherwise;
|
||||
@@ -244,4 +287,4 @@ waiters) and true otherwise;
|
||||
bool completion_done(struct completion *done)
|
||||
|
||||
Both try_wait_for_completion() and completion_done() are safe to be called in
|
||||
hard-irq or atomic context.
|
||||
IRQ or atomic context.
|
||||
|
Reference in New Issue
Block a user