docs: move locking-specific documents to locking/

Several files under Documentation/*.txt describe some type of
locking API. Move them to locking/ subdir and add to the
locking/index.rst index file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/dd833a10bbd0b2c1461d78913f5ec28a7e27f00b.1588345503.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Цей коміт міститься в:
Mauro Carvalho Chehab
2020-05-01 17:37:54 +02:00
зафіксовано Jonathan Corbet
джерело 9184027f0a
коміт 95ca6d73a8
10 змінених файлів з 11 додано та 4 видалено

132
Documentation/locking/futex-requeue-pi.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,132 @@
================
Futex Requeue PI
================
Requeueing of tasks from a non-PI futex to a PI futex requires
special handling in order to ensure the underlying rt_mutex is never
left without an owner if it has waiters; doing so would break the PI
boosting logic [see rt-mutex-desgin.txt] For the purposes of
brevity, this action will be referred to as "requeue_pi" throughout
this document. Priority inheritance is abbreviated throughout as
"PI".
Motivation
----------
Without requeue_pi, the glibc implementation of
pthread_cond_broadcast() must resort to waking all the tasks waiting
on a pthread_condvar and letting them try to sort out which task
gets to run first in classic thundering-herd formation. An ideal
implementation would wake the highest-priority waiter, and leave the
rest to the natural wakeup inherent in unlocking the mutex
associated with the condvar.
Consider the simplified glibc calls::
/* caller must lock mutex */
pthread_cond_wait(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
lock(mutex);
}
pthread_cond_broadcast(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue(cond->data.__futex, cond->mutex);
}
Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
has waiters. Note that pthread_cond_wait() attempts to lock the
mutex only after it has returned to user space. This will leave the
underlying rt_mutex with waiters, and no owner, breaking the
previously mentioned PI-boosting algorithms.
In order to support PI-aware pthread_condvar's, the kernel needs to
be able to requeue tasks to PI futexes. This support implies that
upon a successful futex_wait system call, the caller would return to
user space already holding the PI futex. The glibc implementation
would be modified as follows::
/* caller must lock mutex */
pthread_cond_wait_pi(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait_requeue_pi(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
/* the kernel acquired the mutex for us */
}
pthread_cond_broadcast_pi(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue_pi(cond->data.__futex, cond->mutex);
}
The actual glibc implementation will likely test for PI and make the
necessary changes inside the existing calls rather than creating new
calls for the PI cases. Similar changes are needed for
pthread_cond_timedwait() and pthread_cond_signal().
Implementation
--------------
In order to ensure the rt_mutex has an owner if it has waiters, it
is necessary for both the requeue code, as well as the waiting code,
to be able to acquire the rt_mutex before returning to user space.
The requeue code cannot simply wake the waiter and leave it to
acquire the rt_mutex as it would open a race window between the
requeue call returning to user space and the waiter waking and
starting to run. This is especially true in the uncontended case.
The solution involves two new rt_mutex helper routines,
rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
allow the requeue code to acquire an uncontended rt_mutex on behalf
of the waiter and to enqueue the waiter on a contended rt_mutex.
Two new system calls provide the kernel<->user interface to
requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
and pthread_cond_timedwait()) to block on the initial futex and wait
to be requeued to a PI-aware futex. The implementation is the
result of a high-speed collision between futex_wait() and
futex_lock_pi(), with some extra logic to check for the additional
wake-up scenarios.
FUTEX_CMP_REQUEUE_PI is called by the waker
(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
possibly wake the waiting tasks. Internally, this system call is
still handled by futex_requeue (by passing requeue_pi=1). Before
requeueing, futex_requeue() attempts to acquire the requeue target
PI futex on behalf of the top waiter. If it can, this waiter is
woken. futex_requeue() then proceeds to requeue the remaining
nr_wake+nr_requeue tasks to the PI futex, calling
rt_mutex_start_proxy_lock() prior to each requeue to prepare the
task as a waiter on the underlying rt_mutex. It is possible that
the lock can be acquired at this stage as well, if so, the next
waiter is woken to finish the acquisition of the lock.
FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
their sum is all that really matters. futex_requeue() will wake or
requeue up to nr_wake + nr_requeue tasks. It will wake only as many
tasks as it can acquire the lock for, which in the majority of cases
should be 0 as good programming practice dictates that the caller of
either pthread_cond_broadcast() or pthread_cond_signal() acquire the
mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
signal.

485
Documentation/locking/hwspinlock.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,485 @@
===========================
Hardware Spinlock Framework
===========================
Introduction
============
Hardware spinlock modules provide hardware assistance for synchronization
and mutual exclusion between heterogeneous processors and those not operating
under a single, shared operating system.
For example, OMAP4 has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP,
each of which is running a different Operating System (the master, A9,
is usually running Linux and the slave processors, the M3 and the DSP,
are running some flavor of RTOS).
A generic hwspinlock framework allows platform-independent drivers to use
the hwspinlock device in order to access data structures that are shared
between remote processors, that otherwise have no alternative mechanism
to accomplish synchronization and mutual exclusion operations.
This is necessary, for example, for Inter-processor communications:
on OMAP4, cpu-intensive multimedia tasks are offloaded by the host to the
remote M3 and/or C64x+ slave processors (by an IPC subsystem called Syslink).
To achieve fast message-based communications, a minimal kernel support
is needed to deliver messages arriving from a remote processor to the
appropriate user process.
This communication is based on simple data structures that is shared between
the remote processors, and access to it is synchronized using the hwspinlock
module (remote processor directly places new messages in this shared data
structure).
A common hwspinlock interface makes it possible to have generic, platform-
independent, drivers.
User API
========
::
struct hwspinlock *hwspin_lock_request(void);
Dynamically assign an hwspinlock and return its address, or NULL
in case an unused hwspinlock isn't available. Users of this
API will usually want to communicate the lock's id to the remote core
before it can be used to achieve synchronization.
Should be called from a process context (might sleep).
::
struct hwspinlock *hwspin_lock_request_specific(unsigned int id);
Assign a specific hwspinlock id and return its address, or NULL
if that hwspinlock is already in use. Usually board code will
be calling this function in order to reserve specific hwspinlock
ids for predefined purposes.
Should be called from a process context (might sleep).
::
int of_hwspin_lock_get_id(struct device_node *np, int index);
Retrieve the global lock id for an OF phandle-based specific lock.
This function provides a means for DT users of a hwspinlock module
to get the global lock id of a specific hwspinlock, so that it can
be requested using the normal hwspin_lock_request_specific() API.
The function returns a lock id number on success, -EPROBE_DEFER if
the hwspinlock device is not yet registered with the core, or other
error values.
Should be called from a process context (might sleep).
::
int hwspin_lock_free(struct hwspinlock *hwlock);
Free a previously-assigned hwspinlock; returns 0 on success, or an
appropriate error code on failure (e.g. -EINVAL if the hwspinlock
is already free).
Should be called from a process context (might sleep).
::
int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout);
Lock a previously-assigned hwspinlock with a timeout limit (specified in
msecs). If the hwspinlock is already taken, the function will busy loop
waiting for it to be released, but give up when the timeout elapses.
Upon a successful return from this function, preemption is disabled so
the caller must not sleep, and is advised to release the hwspinlock as
soon as possible, in order to minimize remote cores polling on the
hardware interconnect.
Returns 0 when successful and an appropriate error code otherwise (most
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
The function will never sleep.
::
int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout);
Lock a previously-assigned hwspinlock with a timeout limit (specified in
msecs). If the hwspinlock is already taken, the function will busy loop
waiting for it to be released, but give up when the timeout elapses.
Upon a successful return from this function, preemption and the local
interrupts are disabled, so the caller must not sleep, and is advised to
release the hwspinlock as soon as possible.
Returns 0 when successful and an appropriate error code otherwise (most
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
The function will never sleep.
::
int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to,
unsigned long *flags);
Lock a previously-assigned hwspinlock with a timeout limit (specified in
msecs). If the hwspinlock is already taken, the function will busy loop
waiting for it to be released, but give up when the timeout elapses.
Upon a successful return from this function, preemption is disabled,
local interrupts are disabled and their previous state is saved at the
given flags placeholder. The caller must not sleep, and is advised to
release the hwspinlock as soon as possible.
Returns 0 when successful and an appropriate error code otherwise (most
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
The function will never sleep.
::
int hwspin_lock_timeout_raw(struct hwspinlock *hwlock, unsigned int timeout);
Lock a previously-assigned hwspinlock with a timeout limit (specified in
msecs). If the hwspinlock is already taken, the function will busy loop
waiting for it to be released, but give up when the timeout elapses.
Caution: User must protect the routine of getting hardware lock with mutex
or spinlock to avoid dead-lock, that will let user can do some time-consuming
or sleepable operations under the hardware lock.
Returns 0 when successful and an appropriate error code otherwise (most
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
The function will never sleep.
::
int hwspin_lock_timeout_in_atomic(struct hwspinlock *hwlock, unsigned int to);
Lock a previously-assigned hwspinlock with a timeout limit (specified in
msecs). If the hwspinlock is already taken, the function will busy loop
waiting for it to be released, but give up when the timeout elapses.
This function shall be called only from an atomic context and the timeout
value shall not exceed a few msecs.
Returns 0 when successful and an appropriate error code otherwise (most
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
The function will never sleep.
::
int hwspin_trylock(struct hwspinlock *hwlock);
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
it is already taken.
Upon a successful return from this function, preemption is disabled so
caller must not sleep, and is advised to release the hwspinlock as soon as
possible, in order to minimize remote cores polling on the hardware
interconnect.
Returns 0 on success and an appropriate error code otherwise (most
notably -EBUSY if the hwspinlock was already taken).
The function will never sleep.
::
int hwspin_trylock_irq(struct hwspinlock *hwlock);
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
it is already taken.
Upon a successful return from this function, preemption and the local
interrupts are disabled so caller must not sleep, and is advised to
release the hwspinlock as soon as possible.
Returns 0 on success and an appropriate error code otherwise (most
notably -EBUSY if the hwspinlock was already taken).
The function will never sleep.
::
int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags);
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
it is already taken.
Upon a successful return from this function, preemption is disabled,
the local interrupts are disabled and their previous state is saved
at the given flags placeholder. The caller must not sleep, and is advised
to release the hwspinlock as soon as possible.
Returns 0 on success and an appropriate error code otherwise (most
notably -EBUSY if the hwspinlock was already taken).
The function will never sleep.
::
int hwspin_trylock_raw(struct hwspinlock *hwlock);
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
it is already taken.
Caution: User must protect the routine of getting hardware lock with mutex
or spinlock to avoid dead-lock, that will let user can do some time-consuming
or sleepable operations under the hardware lock.
Returns 0 on success and an appropriate error code otherwise (most
notably -EBUSY if the hwspinlock was already taken).
The function will never sleep.
::
int hwspin_trylock_in_atomic(struct hwspinlock *hwlock);
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
it is already taken.
This function shall be called only from an atomic context.
Returns 0 on success and an appropriate error code otherwise (most
notably -EBUSY if the hwspinlock was already taken).
The function will never sleep.
::
void hwspin_unlock(struct hwspinlock *hwlock);
Unlock a previously-locked hwspinlock. Always succeed, and can be called
from any context (the function never sleeps).
.. note::
code should **never** unlock an hwspinlock which is already unlocked
(there is no protection against this).
::
void hwspin_unlock_irq(struct hwspinlock *hwlock);
Unlock a previously-locked hwspinlock and enable local interrupts.
The caller should **never** unlock an hwspinlock which is already unlocked.
Doing so is considered a bug (there is no protection against this).
Upon a successful return from this function, preemption and local
interrupts are enabled. This function will never sleep.
::
void
hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags);
Unlock a previously-locked hwspinlock.
The caller should **never** unlock an hwspinlock which is already unlocked.
Doing so is considered a bug (there is no protection against this).
Upon a successful return from this function, preemption is reenabled,
and the state of the local interrupts is restored to the state saved at
the given flags. This function will never sleep.
::
void hwspin_unlock_raw(struct hwspinlock *hwlock);
Unlock a previously-locked hwspinlock.
The caller should **never** unlock an hwspinlock which is already unlocked.
Doing so is considered a bug (there is no protection against this).
This function will never sleep.
::
void hwspin_unlock_in_atomic(struct hwspinlock *hwlock);
Unlock a previously-locked hwspinlock.
The caller should **never** unlock an hwspinlock which is already unlocked.
Doing so is considered a bug (there is no protection against this).
This function will never sleep.
::
int hwspin_lock_get_id(struct hwspinlock *hwlock);
Retrieve id number of a given hwspinlock. This is needed when an
hwspinlock is dynamically assigned: before it can be used to achieve
mutual exclusion with a remote cpu, the id number should be communicated
to the remote task with which we want to synchronize.
Returns the hwspinlock id number, or -EINVAL if hwlock is null.
Typical usage
=============
::
#include <linux/hwspinlock.h>
#include <linux/err.h>
int hwspinlock_example1(void)
{
struct hwspinlock *hwlock;
int ret;
/* dynamically assign a hwspinlock */
hwlock = hwspin_lock_request();
if (!hwlock)
...
id = hwspin_lock_get_id(hwlock);
/* probably need to communicate id to a remote processor now */
/* take the lock, spin for 1 sec if it's already taken */
ret = hwspin_lock_timeout(hwlock, 1000);
if (ret)
...
/*
* we took the lock, do our thing now, but do NOT sleep
*/
/* release the lock */
hwspin_unlock(hwlock);
/* free the lock */
ret = hwspin_lock_free(hwlock);
if (ret)
...
return ret;
}
int hwspinlock_example2(void)
{
struct hwspinlock *hwlock;
int ret;
/*
* assign a specific hwspinlock id - this should be called early
* by board init code.
*/
hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID);
if (!hwlock)
...
/* try to take it, but don't spin on it */
ret = hwspin_trylock(hwlock);
if (!ret) {
pr_info("lock is already taken\n");
return -EBUSY;
}
/*
* we took the lock, do our thing now, but do NOT sleep
*/
/* release the lock */
hwspin_unlock(hwlock);
/* free the lock */
ret = hwspin_lock_free(hwlock);
if (ret)
...
return ret;
}
API for implementors
====================
::
int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev,
const struct hwspinlock_ops *ops, int base_id, int num_locks);
To be called from the underlying platform-specific implementation, in
order to register a new hwspinlock device (which is usually a bank of
numerous locks). Should be called from a process context (this function
might sleep).
Returns 0 on success, or appropriate error code on failure.
::
int hwspin_lock_unregister(struct hwspinlock_device *bank);
To be called from the underlying vendor-specific implementation, in order
to unregister an hwspinlock device (which is usually a bank of numerous
locks).
Should be called from a process context (this function might sleep).
Returns the address of hwspinlock on success, or NULL on error (e.g.
if the hwspinlock is still in use).
Important structs
=================
struct hwspinlock_device is a device which usually contains a bank
of hardware locks. It is registered by the underlying hwspinlock
implementation using the hwspin_lock_register() API.
::
/**
* struct hwspinlock_device - a device which usually spans numerous hwspinlocks
* @dev: underlying device, will be used to invoke runtime PM api
* @ops: platform-specific hwspinlock handlers
* @base_id: id index of the first lock in this device
* @num_locks: number of locks in this device
* @lock: dynamically allocated array of 'struct hwspinlock'
*/
struct hwspinlock_device {
struct device *dev;
const struct hwspinlock_ops *ops;
int base_id;
int num_locks;
struct hwspinlock lock[0];
};
struct hwspinlock_device contains an array of hwspinlock structs, each
of which represents a single hardware lock::
/**
* struct hwspinlock - this struct represents a single hwspinlock instance
* @bank: the hwspinlock_device structure which owns this lock
* @lock: initialized and used by hwspinlock core
* @priv: private data, owned by the underlying platform-specific hwspinlock drv
*/
struct hwspinlock {
struct hwspinlock_device *bank;
spinlock_t lock;
void *priv;
};
When registering a bank of locks, the hwspinlock driver only needs to
set the priv members of the locks. The rest of the members are set and
initialized by the hwspinlock core itself.
Implementation callbacks
========================
There are three possible callbacks defined in 'struct hwspinlock_ops'::
struct hwspinlock_ops {
int (*trylock)(struct hwspinlock *lock);
void (*unlock)(struct hwspinlock *lock);
void (*relax)(struct hwspinlock *lock);
};
The first two callbacks are mandatory:
The ->trylock() callback should make a single attempt to take the lock, and
return 0 on failure and 1 on success. This callback may **not** sleep.
The ->unlock() callback releases the lock. It always succeed, and it, too,
may **not** sleep.
The ->relax() callback is optional. It is called by hwspinlock core while
spinning on a lock, and can be used by the underlying implementation to force
a delay between two successive invocations of ->trylock(). It may **not** sleep.

Переглянути файл

@@ -16,6 +16,13 @@ locking
rt-mutex
spinlocks
ww-mutex-design
preempt-locking
pi-futex
futex-requeue-pi
hwspinlock
percpu-rw-semaphore
robust-futexes
robust-futex-ABI
.. only:: subproject and html

28
Documentation/locking/percpu-rw-semaphore.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,28 @@
====================
Percpu rw semaphores
====================
Percpu rw semaphores is a new read-write semaphore design that is
optimized for locking for reading.
The problem with traditional read-write semaphores is that when multiple
cores take the lock for reading, the cache line containing the semaphore
is bouncing between L1 caches of the cores, causing performance
degradation.
Locking for reading is very fast, it uses RCU and it avoids any atomic
instruction in the lock and unlock path. On the other hand, locking for
writing is very expensive, it calls synchronize_rcu() that can take
hundreds of milliseconds.
The lock is declared with "struct percpu_rw_semaphore" type.
The lock is initialized percpu_init_rwsem, it returns 0 on success and
-ENOMEM on allocation failure.
The lock must be freed with percpu_free_rwsem to avoid memory leak.
The lock is locked for read with percpu_down_read, percpu_up_read and
for write with percpu_down_write, percpu_up_write.
The idea of using RCU for optimized rw-lock was introduced by
Eric Dumazet <eric.dumazet@gmail.com>.
The code was written by Mikulas Patocka <mpatocka@redhat.com>

122
Documentation/locking/pi-futex.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,122 @@
======================
Lightweight PI-futexes
======================
We are calling them lightweight for 3 reasons:
- in the user-space fastpath a PI-enabled futex involves no kernel work
(or any other PI complexity) at all. No registration, no extra kernel
calls - just pure fast atomic ops in userspace.
- even in the slowpath, the system call and scheduling pattern is very
similar to normal futexes.
- the in-kernel PI implementation is streamlined around the mutex
abstraction, with strict rules that keep the implementation
relatively simple: only a single owner may own a lock (i.e. no
read-write lock support), only the owner may unlock a lock, no
recursive locking, etc.
Priority Inheritance - why?
---------------------------
The short reply: user-space PI helps achieving/improving determinism for
user-space applications. In the best-case, it can help achieve
determinism and well-bound latencies. Even in the worst-case, PI will
improve the statistical distribution of locking related application
delays.
The longer reply
----------------
Firstly, sharing locks between multiple tasks is a common programming
technique that often cannot be replaced with lockless algorithms. As we
can see it in the kernel [which is a quite complex program in itself],
lockless structures are rather the exception than the norm - the current
ratio of lockless vs. locky code for shared data structures is somewhere
between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
algorithms often endangers to ability to do robust reviews of said code.
I.e. critical RT apps often choose lock structures to protect critical
data structures, instead of lockless algorithms. Furthermore, there are
cases (like shared hardware, or other resource limits) where lockless
access is mathematically impossible.
Media players (such as Jack) are an example of reasonable application
design with multiple tasks (with multiple priority levels) sharing
short-held locks: for example, a highprio audio playback thread is
combined with medium-prio construct-audio-data threads and low-prio
display-colory-stuff threads. Add video and decoding to the mix and
we've got even more priority levels.
So once we accept that synchronization objects (locks) are an
unavoidable fact of life, and once we accept that multi-task userspace
apps have a very fair expectation of being able to use locks, we've got
to think about how to offer the option of a deterministic locking
implementation to user-space.
Most of the technical counter-arguments against doing priority
inheritance only apply to kernel-space locks. But user-space locks are
different, there we cannot disable interrupts or make the task
non-preemptible in a critical section, so the 'use spinlocks' argument
does not apply (user-space spinlocks have the same priority inversion
problems as other user-space locking constructs). Fact is, pretty much
the only technique that currently enables good determinism for userspace
locks (such as futex-based pthread mutexes) is priority inheritance:
Currently (without PI), if a high-prio and a low-prio task shares a lock
[this is a quite common scenario for most non-trivial RT applications],
even if all critical sections are coded carefully to be deterministic
(i.e. all critical sections are short in duration and only execute a
limited number of instructions), the kernel cannot guarantee any
deterministic execution of the high-prio task: any medium-priority task
could preempt the low-prio task while it holds the shared lock and
executes the critical section, and could delay it indefinitely.
Implementation
--------------
As mentioned before, the userspace fastpath of PI-enabled pthread
mutexes involves no kernel work at all - they behave quite similarly to
normal futex-based locks: a 0 value means unlocked, and a value==TID
means locked. (This is the same method as used by list-based robust
futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
entering the kernel.
To handle the slowpath, we have added two new futex ops:
- FUTEX_LOCK_PI
- FUTEX_UNLOCK_PI
If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
remaining work: if there is no futex-queue attached to the futex address
yet then the code looks up the task that owns the futex [it has put its
own TID into the futex value], and attaches a 'PI state' structure to
the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
kernel-based synchronization object. The 'other' task is made the owner
of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
futex value. Then this task tries to lock the rt-mutex, on which it
blocks. Once it returns, it has the mutex acquired, and it sets the
futex value to its own TID and returns. Userspace has no other work to
perform - it now owns the lock, and futex value contains
FUTEX_WAITERS|TID.
If the unlock side fastpath succeeds, [i.e. userspace manages to do a
TID -> 0 atomic transition of the futex value], then no kernel work is
triggered.
If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
behalf of userspace - and it also unlocks the attached
pi_state->rt_mutex and thus wakes up any potential waiters.
Note that under this approach, contrary to previous PI-futex approaches,
there is no prior 'registration' of a PI-futex. [which is not quite
possible anyway, due to existing ABI properties of pthread mutexes.]
Also, under this scheme, 'robustness' and 'PI' are two orthogonal
properties of futexes, and all four combinations are possible: futex,
robust-futex, PI-futex, robust+PI-futex.
More details about priority inheritance can be found in
Documentation/locking/rt-mutex.rst.

144
Documentation/locking/preempt-locking.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,144 @@
===========================================================================
Proper Locking Under a Preemptible Kernel: Keeping Kernel Code Preempt-Safe
===========================================================================
:Author: Robert Love <rml@tech9.net>
Introduction
============
A preemptible kernel creates new locking issues. The issues are the same as
those under SMP: concurrency and reentrancy. Thankfully, the Linux preemptible
kernel model leverages existing SMP locking mechanisms. Thus, the kernel
requires explicit additional locking for very few additional situations.
This document is for all kernel hackers. Developing code in the kernel
requires protecting these situations.
RULE #1: Per-CPU data structures need explicit protection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Two similar problems arise. An example code snippet::
struct this_needs_locking tux[NR_CPUS];
tux[smp_processor_id()] = some_value;
/* task is preempted here... */
something = tux[smp_processor_id()];
First, since the data is per-CPU, it may not have explicit SMP locking, but
require it otherwise. Second, when a preempted task is finally rescheduled,
the previous value of smp_processor_id may not equal the current. You must
protect these situations by disabling preemption around them.
You can also use put_cpu() and get_cpu(), which will disable preemption.
RULE #2: CPU state must be protected.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Under preemption, the state of the CPU must be protected. This is arch-
dependent, but includes CPU structures and state not preserved over a context
switch. For example, on x86, entering and exiting FPU mode is now a critical
section that must occur while preemption is disabled. Think what would happen
if the kernel is executing a floating-point instruction and is then preempted.
Remember, the kernel does not save FPU state except for user tasks. Therefore,
upon preemption, the FPU registers will be sold to the lowest bidder. Thus,
preemption must be disabled around such regions.
Note, some FPU functions are already explicitly preempt safe. For example,
kernel_fpu_begin and kernel_fpu_end will disable and enable preemption.
RULE #3: Lock acquire and release must be performed by same task
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A lock acquired in one task must be released by the same task. This
means you can't do oddball things like acquire a lock and go off to
play while another task releases it. If you want to do something
like this, acquire and release the task in the same code path and
have the caller wait on an event by the other task.
Solution
========
Data protection under preemption is achieved by disabling preemption for the
duration of the critical region.
::
preempt_enable() decrement the preempt counter
preempt_disable() increment the preempt counter
preempt_enable_no_resched() decrement, but do not immediately preempt
preempt_check_resched() if needed, reschedule
preempt_count() return the preempt counter
The functions are nestable. In other words, you can call preempt_disable
n-times in a code path, and preemption will not be reenabled until the n-th
call to preempt_enable. The preempt statements define to nothing if
preemption is not enabled.
Note that you do not need to explicitly prevent preemption if you are holding
any locks or interrupts are disabled, since preemption is implicitly disabled
in those cases.
But keep in mind that 'irqs disabled' is a fundamentally unsafe way of
disabling preemption - any cond_resched() or cond_resched_lock() might trigger
a reschedule if the preempt count is 0. A simple printk() might trigger a
reschedule. So use this implicit preemption-disabling property only if you
know that the affected codepath does not do any of this. Best policy is to use
this only for small, atomic code that you wrote and which calls no complex
functions.
Example::
cpucache_t *cc; /* this is per-CPU */
preempt_disable();
cc = cc_data(searchp);
if (cc && cc->avail) {
__free_block(searchp, cc_entry(cc), cc->avail);
cc->avail = 0;
}
preempt_enable();
return 0;
Notice how the preemption statements must encompass every reference of the
critical variables. Another example::
int buf[NR_CPUS];
set_cpu_val(buf);
if (buf[smp_processor_id()] == -1) printf(KERN_INFO "wee!\n");
spin_lock(&buf_lock);
/* ... */
This code is not preempt-safe, but see how easily we can fix it by simply
moving the spin_lock up two lines.
Preventing preemption using interrupt disabling
===============================================
It is possible to prevent a preemption event using local_irq_disable and
local_irq_save. Note, when doing so, you must be very careful to not cause
an event that would set need_resched and result in a preemption check. When
in doubt, rely on locking or explicit preemption disabling.
Note in 2.5 interrupt disabling is now only per-CPU (e.g. local).
An additional concern is proper usage of local_irq_disable and local_irq_save.
These may be used to protect from preemption, however, on exit, if preemption
may be enabled, a test to see if preemption is required should be done. If
these are called from the spin_lock and read/write lock macros, the right thing
is done. They may also be called within a spin-lock protected region, however,
if they are ever called outside of this context, a test for preemption should
be made. Do note that calls from interrupt context or bottom half/ tasklets
are also protected by preemption locks and so may use the versions which do
not check preemption.

184
Documentation/locking/robust-futex-ABI.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,184 @@
====================
The robust futex ABI
====================
:Author: Started by Paul Jackson <pj@sgi.com>
Robust_futexes provide a mechanism that is used in addition to normal
futexes, for kernel assist of cleanup of held locks on task exit.
The interesting data as to what futexes a thread is holding is kept on a
linked list in user space, where it can be updated efficiently as locks
are taken and dropped, without kernel intervention. The only additional
kernel intervention required for robust_futexes above and beyond what is
required for futexes is:
1) a one time call, per thread, to tell the kernel where its list of
held robust_futexes begins, and
2) internal kernel code at exit, to handle any listed locks held
by the exiting thread.
The existing normal futexes already provide a "Fast Userspace Locking"
mechanism, which handles uncontested locking without needing a system
call, and handles contested locking by maintaining a list of waiting
threads in the kernel. Options on the sys_futex(2) system call support
waiting on a particular futex, and waking up the next waiter on a
particular futex.
For robust_futexes to work, the user code (typically in a library such
as glibc linked with the application) has to manage and place the
necessary list elements exactly as the kernel expects them. If it fails
to do so, then improperly listed locks will not be cleaned up on exit,
probably causing deadlock or other such failure of the other threads
waiting on the same locks.
A thread that anticipates possibly using robust_futexes should first
issue the system call::
asmlinkage long
sys_set_robust_list(struct robust_list_head __user *head, size_t len);
The pointer 'head' points to a structure in the threads address space
consisting of three words. Each word is 32 bits on 32 bit arch's, or 64
bits on 64 bit arch's, and local byte order. Each thread should have
its own thread private 'head'.
If a thread is running in 32 bit compatibility mode on a 64 native arch
kernel, then it can actually have two such structures - one using 32 bit
words for 32 bit compatibility mode, and one using 64 bit words for 64
bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bit
compatibility mode, will attempt to process both lists on each task
exit, if the corresponding sys_set_robust_list() call has been made to
setup that list.
The first word in the memory structure at 'head' contains a
pointer to a single linked list of 'lock entries', one per lock,
as described below. If the list is empty, the pointer will point
to itself, 'head'. The last 'lock entry' points back to the 'head'.
The second word, called 'offset', specifies the offset from the
address of the associated 'lock entry', plus or minus, of what will
be called the 'lock word', from that 'lock entry'. The 'lock word'
is always a 32 bit word, unlike the other words above. The 'lock
word' holds 2 flag bits in the upper 2 bits, and the thread id (TID)
of the thread holding the lock in the bottom 30 bits. See further
below for a description of the flag bits.
The third word, called 'list_op_pending', contains transient copy of
the address of the 'lock entry', during list insertion and removal,
and is needed to correctly resolve races should a thread exit while
in the middle of a locking or unlocking operation.
Each 'lock entry' on the single linked list starting at 'head' consists
of just a single word, pointing to the next 'lock entry', or back to
'head' if there are no more entries. In addition, nearby to each 'lock
entry', at an offset from the 'lock entry' specified by the 'offset'
word, is one 'lock word'.
The 'lock word' is always 32 bits, and is intended to be the same 32 bit
lock variable used by the futex mechanism, in conjunction with
robust_futexes. The kernel will only be able to wakeup the next thread
waiting for a lock on a threads exit if that next thread used the futex
mechanism to register the address of that 'lock word' with the kernel.
For each futex lock currently held by a thread, if it wants this
robust_futex support for exit cleanup of that lock, it should have one
'lock entry' on this list, with its associated 'lock word' at the
specified 'offset'. Should a thread die while holding any such locks,
the kernel will walk this list, mark any such locks with a bit
indicating their holder died, and wakeup the next thread waiting for
that lock using the futex mechanism.
When a thread has invoked the above system call to indicate it
anticipates using robust_futexes, the kernel stores the passed in 'head'
pointer for that task. The task may retrieve that value later on by
using the system call::
asmlinkage long
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
size_t __user *len_ptr);
It is anticipated that threads will use robust_futexes embedded in
larger, user level locking structures, one per lock. The kernel
robust_futex mechanism doesn't care what else is in that structure, so
long as the 'offset' to the 'lock word' is the same for all
robust_futexes used by that thread. The thread should link those locks
it currently holds using the 'lock entry' pointers. It may also have
other links between the locks, such as the reverse side of a double
linked list, but that doesn't matter to the kernel.
By keeping its locks linked this way, on a list starting with a 'head'
pointer known to the kernel, the kernel can provide to a thread the
essential service available for robust_futexes, which is to help clean
up locks held at the time of (a perhaps unexpectedly) exit.
Actual locking and unlocking, during normal operations, is handled
entirely by user level code in the contending threads, and by the
existing futex mechanism to wait for, and wakeup, locks. The kernels
only essential involvement in robust_futexes is to remember where the
list 'head' is, and to walk the list on thread exit, handling locks
still held by the departing thread, as described below.
There may exist thousands of futex lock structures in a threads shared
memory, on various data structures, at a given point in time. Only those
lock structures for locks currently held by that thread should be on
that thread's robust_futex linked lock list a given time.
A given futex lock structure in a user shared memory region may be held
at different times by any of the threads with access to that region. The
thread currently holding such a lock, if any, is marked with the threads
TID in the lower 30 bits of the 'lock word'.
When adding or removing a lock from its list of held locks, in order for
the kernel to correctly handle lock cleanup regardless of when the task
exits (perhaps it gets an unexpected signal 9 in the middle of
manipulating this list), the user code must observe the following
protocol on 'lock entry' insertion and removal:
On insertion:
1) set the 'list_op_pending' word to the address of the 'lock entry'
to be inserted,
2) acquire the futex lock,
3) add the lock entry, with its thread id (TID) in the bottom 30 bits
of the 'lock word', to the linked list starting at 'head', and
4) clear the 'list_op_pending' word.
On removal:
1) set the 'list_op_pending' word to the address of the 'lock entry'
to be removed,
2) remove the lock entry for this lock from the 'head' list,
3) release the futex lock, and
4) clear the 'lock_op_pending' word.
On exit, the kernel will consider the address stored in
'list_op_pending' and the address of each 'lock word' found by walking
the list starting at 'head'. For each such address, if the bottom 30
bits of the 'lock word' at offset 'offset' from that address equals the
exiting threads TID, then the kernel will do two things:
1) if bit 31 (0x80000000) is set in that word, then attempt a futex
wakeup on that address, which will waken the next thread that has
used to the futex mechanism to wait on that address, and
2) atomically set bit 30 (0x40000000) in the 'lock word'.
In the above, bit 31 was set by futex waiters on that lock to indicate
they were waiting, and bit 30 is set by the kernel to indicate that the
lock owner died holding the lock.
The kernel exit code will silently stop scanning the list further if at
any point:
1) the 'head' pointer or an subsequent linked list pointer
is not a valid address of a user space word
2) the calculated location of the 'lock word' (address plus
'offset') is not the valid address of a 32 bit user space
word
3) if the list contains more than 1 million (subject to
future kernel configuration changes) elements.
When the kernel sees a list entry whose 'lock word' doesn't have the
current threads TID in the lower 30 bits, it does nothing with that
entry, and goes on to the next entry.

221
Documentation/locking/robust-futexes.rst Звичайний файл
Переглянути файл

@@ -0,0 +1,221 @@
========================================
A description of what robust futexes are
========================================
:Started by: Ingo Molnar <mingo@redhat.com>
Background
----------
what are robust futexes? To answer that, we first need to understand
what futexes are: normal futexes are special types of locks that in the
noncontended case can be acquired/released from userspace without having
to enter the kernel.
A futex is in essence a user-space address, e.g. a 32-bit lock variable
field. If userspace notices contention (the lock is already owned and
someone else wants to grab it too) then the lock is marked with a value
that says "there's a waiter pending", and the sys_futex(FUTEX_WAIT)
syscall is used to wait for the other guy to release it. The kernel
creates a 'futex queue' internally, so that it can later on match up the
waiter with the waker - without them having to know about each other.
When the owner thread releases the futex, it notices (via the variable
value) that there were waiter(s) pending, and does the
sys_futex(FUTEX_WAKE) syscall to wake them up. Once all waiters have
taken and released the lock, the futex is again back to 'uncontended'
state, and there's no in-kernel state associated with it. The kernel
completely forgets that there ever was a futex at that address. This
method makes futexes very lightweight and scalable.
"Robustness" is about dealing with crashes while holding a lock: if a
process exits prematurely while holding a pthread_mutex_t lock that is
also shared with some other process (e.g. yum segfaults while holding a
pthread_mutex_t, or yum is kill -9-ed), then waiters for that lock need
to be notified that the last owner of the lock exited in some irregular
way.
To solve such types of problems, "robust mutex" userspace APIs were
created: pthread_mutex_lock() returns an error value if the owner exits
prematurely - and the new owner can decide whether the data protected by
the lock can be recovered safely.
There is a big conceptual problem with futex based mutexes though: it is
the kernel that destroys the owner task (e.g. due to a SEGFAULT), but
the kernel cannot help with the cleanup: if there is no 'futex queue'
(and in most cases there is none, futexes being fast lightweight locks)
then the kernel has no information to clean up after the held lock!
Userspace has no chance to clean up after the lock either - userspace is
the one that crashes, so it has no opportunity to clean up. Catch-22.
In practice, when e.g. yum is kill -9-ed (or segfaults), a system reboot
is needed to release that futex based lock. This is one of the leading
bugreports against yum.
To solve this problem, the traditional approach was to extend the vma
(virtual memory area descriptor) concept to have a notion of 'pending
robust futexes attached to this area'. This approach requires 3 new
syscall variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and
FUTEX_RECOVER. At do_exit() time, all vmas are searched to see whether
they have a robust_head set. This approach has two fundamental problems
left:
- it has quite complex locking and race scenarios. The vma-based
approach had been pending for years, but they are still not completely
reliable.
- they have to scan _every_ vma at sys_exit() time, per thread!
The second disadvantage is a real killer: pthread_exit() takes around 1
microsecond on Linux, but with thousands (or tens of thousands) of vmas
every pthread_exit() takes a millisecond or more, also totally
destroying the CPU's L1 and L2 caches!
This is very much noticeable even for normal process sys_exit_group()
calls: the kernel has to do the vma scanning unconditionally! (this is
because the kernel has no knowledge about how many robust futexes there
are to be cleaned up, because a robust futex might have been registered
in another task, and the futex variable might have been simply mmap()-ed
into this process's address space).
This huge overhead forced the creation of CONFIG_FUTEX_ROBUST so that
normal kernels can turn it off, but worse than that: the overhead makes
robust futexes impractical for any type of generic Linux distribution.
So something had to be done.
New approach to robust futexes
------------------------------
At the heart of this new approach there is a per-thread private list of
robust locks that userspace is holding (maintained by glibc) - which
userspace list is registered with the kernel via a new syscall [this
registration happens at most once per thread lifetime]. At do_exit()
time, the kernel checks this user-space list: are there any robust futex
locks to be cleaned up?
In the common case, at do_exit() time, there is no list registered, so
the cost of robust futexes is just a simple current->robust_list != NULL
comparison. If the thread has registered a list, then normally the list
is empty. If the thread/process crashed or terminated in some incorrect
way then the list might be non-empty: in this case the kernel carefully
walks the list [not trusting it], and marks all locks that are owned by
this thread with the FUTEX_OWNER_DIED bit, and wakes up one waiter (if
any).
The list is guaranteed to be private and per-thread at do_exit() time,
so it can be accessed by the kernel in a lockless way.
There is one race possible though: since adding to and removing from the
list is done after the futex is acquired by glibc, there is a few
instructions window for the thread (or process) to die there, leaving
the futex hung. To protect against this possibility, userspace (glibc)
also maintains a simple per-thread 'list_op_pending' field, to allow the
kernel to clean up if the thread dies after acquiring the lock, but just
before it could have added itself to the list. Glibc sets this
list_op_pending field before it tries to acquire the futex, and clears
it after the list-add (or list-remove) has finished.
That's all that is needed - all the rest of robust-futex cleanup is done
in userspace [just like with the previous patches].
Ulrich Drepper has implemented the necessary glibc support for this new
mechanism, which fully enables robust mutexes.
Key differences of this userspace-list based approach, compared to the
vma based method:
- it's much, much faster: at thread exit time, there's no need to loop
over every vma (!), which the VM-based method has to do. Only a very
simple 'is the list empty' op is done.
- no VM changes are needed - 'struct address_space' is left alone.
- no registration of individual locks is needed: robust mutexes don't
need any extra per-lock syscalls. Robust mutexes thus become a very
lightweight primitive - so they don't force the application designer
to do a hard choice between performance and robustness - robust
mutexes are just as fast.
- no per-lock kernel allocation happens.
- no resource limits are needed.
- no kernel-space recovery call (FUTEX_RECOVER) is needed.
- the implementation and the locking is "obvious", and there are no
interactions with the VM.
Performance
-----------
I have benchmarked the time needed for the kernel to process a list of 1
million (!) held locks, using the new method [on a 2GHz CPU]:
- with FUTEX_WAIT set [contended mutex]: 130 msecs
- without FUTEX_WAIT set [uncontended mutex]: 30 msecs
I have also measured an approach where glibc does the lock notification
[which it currently does for !pshared robust mutexes], and that took 256
msecs - clearly slower, due to the 1 million FUTEX_WAKE syscalls
userspace had to do.
(1 million held locks are unheard of - we expect at most a handful of
locks to be held at a time. Nevertheless it's nice to know that this
approach scales nicely.)
Implementation details
----------------------
The patch adds two new syscalls: one to register the userspace list, and
one to query the registered list pointer::
asmlinkage long
sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
asmlinkage long
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
size_t __user *len_ptr);
List registration is very fast: the pointer is simply stored in
current->robust_list. [Note that in the future, if robust futexes become
widespread, we could extend sys_clone() to register a robust-list head
for new threads, without the need of another syscall.]
So there is virtually zero overhead for tasks not using robust futexes,
and even for robust futex users, there is only one extra syscall per
thread lifetime, and the cleanup operation, if it happens, is fast and
straightforward. The kernel doesn't have any internal distinction between
robust and normal futexes.
If a futex is found to be held at exit time, the kernel sets the
following bit of the futex word::
#define FUTEX_OWNER_DIED 0x40000000
and wakes up the next futex waiter (if any). User-space does the rest of
the cleanup.
Otherwise, robust futexes are acquired by glibc by putting the TID into
the futex field atomically. Waiters set the FUTEX_WAITERS bit::
#define FUTEX_WAITERS 0x80000000
and the remaining bits are for the TID.
Testing, architecture support
-----------------------------
I've tested the new syscalls on x86 and x86_64, and have made sure the
parsing of the userspace list is robust [ ;-) ] even if the list is
deliberately corrupted.
i386 and x86_64 syscalls are wired up at the moment, and Ulrich has
tested the new glibc code (on x86_64 and i386), and it works for his
robust-mutex testcases.
All other architectures should build just fine too - but they won't have
the new syscalls yet.
Architectures need to implement the new futex_atomic_cmpxchg_inatomic()
inline function before writing up the syscalls.

Переглянути файл

@@ -4,7 +4,7 @@ RT-mutex subsystem with PI support
RT-mutexes with priority inheritance are used to support PI-futexes,
which enable pthread_mutex_t priority inheritance attributes
(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details
(PTHREAD_PRIO_INHERIT). [See Documentation/locking/pi-futex.rst for more details
about PI-futexes.]
This technology was developed in the -rt tree and streamlined for