docs: move locking-specific documents to locking/
Several files under Documentation/*.txt describe some type of locking API. Move them to locking/ subdir and add to the locking/index.rst index file. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/dd833a10bbd0b2c1461d78913f5ec28a7e27f00b.1588345503.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Цей коміт міститься в:

зафіксовано
Jonathan Corbet

джерело
9184027f0a
коміт
95ca6d73a8
132
Documentation/locking/futex-requeue-pi.rst
Звичайний файл
132
Documentation/locking/futex-requeue-pi.rst
Звичайний файл
@@ -0,0 +1,132 @@
|
||||
================
|
||||
Futex Requeue PI
|
||||
================
|
||||
|
||||
Requeueing of tasks from a non-PI futex to a PI futex requires
|
||||
special handling in order to ensure the underlying rt_mutex is never
|
||||
left without an owner if it has waiters; doing so would break the PI
|
||||
boosting logic [see rt-mutex-desgin.txt] For the purposes of
|
||||
brevity, this action will be referred to as "requeue_pi" throughout
|
||||
this document. Priority inheritance is abbreviated throughout as
|
||||
"PI".
|
||||
|
||||
Motivation
|
||||
----------
|
||||
|
||||
Without requeue_pi, the glibc implementation of
|
||||
pthread_cond_broadcast() must resort to waking all the tasks waiting
|
||||
on a pthread_condvar and letting them try to sort out which task
|
||||
gets to run first in classic thundering-herd formation. An ideal
|
||||
implementation would wake the highest-priority waiter, and leave the
|
||||
rest to the natural wakeup inherent in unlocking the mutex
|
||||
associated with the condvar.
|
||||
|
||||
Consider the simplified glibc calls::
|
||||
|
||||
/* caller must lock mutex */
|
||||
pthread_cond_wait(cond, mutex)
|
||||
{
|
||||
lock(cond->__data.__lock);
|
||||
unlock(mutex);
|
||||
do {
|
||||
unlock(cond->__data.__lock);
|
||||
futex_wait(cond->__data.__futex);
|
||||
lock(cond->__data.__lock);
|
||||
} while(...)
|
||||
unlock(cond->__data.__lock);
|
||||
lock(mutex);
|
||||
}
|
||||
|
||||
pthread_cond_broadcast(cond)
|
||||
{
|
||||
lock(cond->__data.__lock);
|
||||
unlock(cond->__data.__lock);
|
||||
futex_requeue(cond->data.__futex, cond->mutex);
|
||||
}
|
||||
|
||||
Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
|
||||
has waiters. Note that pthread_cond_wait() attempts to lock the
|
||||
mutex only after it has returned to user space. This will leave the
|
||||
underlying rt_mutex with waiters, and no owner, breaking the
|
||||
previously mentioned PI-boosting algorithms.
|
||||
|
||||
In order to support PI-aware pthread_condvar's, the kernel needs to
|
||||
be able to requeue tasks to PI futexes. This support implies that
|
||||
upon a successful futex_wait system call, the caller would return to
|
||||
user space already holding the PI futex. The glibc implementation
|
||||
would be modified as follows::
|
||||
|
||||
|
||||
/* caller must lock mutex */
|
||||
pthread_cond_wait_pi(cond, mutex)
|
||||
{
|
||||
lock(cond->__data.__lock);
|
||||
unlock(mutex);
|
||||
do {
|
||||
unlock(cond->__data.__lock);
|
||||
futex_wait_requeue_pi(cond->__data.__futex);
|
||||
lock(cond->__data.__lock);
|
||||
} while(...)
|
||||
unlock(cond->__data.__lock);
|
||||
/* the kernel acquired the mutex for us */
|
||||
}
|
||||
|
||||
pthread_cond_broadcast_pi(cond)
|
||||
{
|
||||
lock(cond->__data.__lock);
|
||||
unlock(cond->__data.__lock);
|
||||
futex_requeue_pi(cond->data.__futex, cond->mutex);
|
||||
}
|
||||
|
||||
The actual glibc implementation will likely test for PI and make the
|
||||
necessary changes inside the existing calls rather than creating new
|
||||
calls for the PI cases. Similar changes are needed for
|
||||
pthread_cond_timedwait() and pthread_cond_signal().
|
||||
|
||||
Implementation
|
||||
--------------
|
||||
|
||||
In order to ensure the rt_mutex has an owner if it has waiters, it
|
||||
is necessary for both the requeue code, as well as the waiting code,
|
||||
to be able to acquire the rt_mutex before returning to user space.
|
||||
The requeue code cannot simply wake the waiter and leave it to
|
||||
acquire the rt_mutex as it would open a race window between the
|
||||
requeue call returning to user space and the waiter waking and
|
||||
starting to run. This is especially true in the uncontended case.
|
||||
|
||||
The solution involves two new rt_mutex helper routines,
|
||||
rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
|
||||
allow the requeue code to acquire an uncontended rt_mutex on behalf
|
||||
of the waiter and to enqueue the waiter on a contended rt_mutex.
|
||||
Two new system calls provide the kernel<->user interface to
|
||||
requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
|
||||
|
||||
FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
|
||||
and pthread_cond_timedwait()) to block on the initial futex and wait
|
||||
to be requeued to a PI-aware futex. The implementation is the
|
||||
result of a high-speed collision between futex_wait() and
|
||||
futex_lock_pi(), with some extra logic to check for the additional
|
||||
wake-up scenarios.
|
||||
|
||||
FUTEX_CMP_REQUEUE_PI is called by the waker
|
||||
(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
|
||||
possibly wake the waiting tasks. Internally, this system call is
|
||||
still handled by futex_requeue (by passing requeue_pi=1). Before
|
||||
requeueing, futex_requeue() attempts to acquire the requeue target
|
||||
PI futex on behalf of the top waiter. If it can, this waiter is
|
||||
woken. futex_requeue() then proceeds to requeue the remaining
|
||||
nr_wake+nr_requeue tasks to the PI futex, calling
|
||||
rt_mutex_start_proxy_lock() prior to each requeue to prepare the
|
||||
task as a waiter on the underlying rt_mutex. It is possible that
|
||||
the lock can be acquired at this stage as well, if so, the next
|
||||
waiter is woken to finish the acquisition of the lock.
|
||||
|
||||
FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
|
||||
their sum is all that really matters. futex_requeue() will wake or
|
||||
requeue up to nr_wake + nr_requeue tasks. It will wake only as many
|
||||
tasks as it can acquire the lock for, which in the majority of cases
|
||||
should be 0 as good programming practice dictates that the caller of
|
||||
either pthread_cond_broadcast() or pthread_cond_signal() acquire the
|
||||
mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
|
||||
nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
|
||||
signal.
|
485
Documentation/locking/hwspinlock.rst
Звичайний файл
485
Documentation/locking/hwspinlock.rst
Звичайний файл
@@ -0,0 +1,485 @@
|
||||
===========================
|
||||
Hardware Spinlock Framework
|
||||
===========================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Hardware spinlock modules provide hardware assistance for synchronization
|
||||
and mutual exclusion between heterogeneous processors and those not operating
|
||||
under a single, shared operating system.
|
||||
|
||||
For example, OMAP4 has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP,
|
||||
each of which is running a different Operating System (the master, A9,
|
||||
is usually running Linux and the slave processors, the M3 and the DSP,
|
||||
are running some flavor of RTOS).
|
||||
|
||||
A generic hwspinlock framework allows platform-independent drivers to use
|
||||
the hwspinlock device in order to access data structures that are shared
|
||||
between remote processors, that otherwise have no alternative mechanism
|
||||
to accomplish synchronization and mutual exclusion operations.
|
||||
|
||||
This is necessary, for example, for Inter-processor communications:
|
||||
on OMAP4, cpu-intensive multimedia tasks are offloaded by the host to the
|
||||
remote M3 and/or C64x+ slave processors (by an IPC subsystem called Syslink).
|
||||
|
||||
To achieve fast message-based communications, a minimal kernel support
|
||||
is needed to deliver messages arriving from a remote processor to the
|
||||
appropriate user process.
|
||||
|
||||
This communication is based on simple data structures that is shared between
|
||||
the remote processors, and access to it is synchronized using the hwspinlock
|
||||
module (remote processor directly places new messages in this shared data
|
||||
structure).
|
||||
|
||||
A common hwspinlock interface makes it possible to have generic, platform-
|
||||
independent, drivers.
|
||||
|
||||
User API
|
||||
========
|
||||
|
||||
::
|
||||
|
||||
struct hwspinlock *hwspin_lock_request(void);
|
||||
|
||||
Dynamically assign an hwspinlock and return its address, or NULL
|
||||
in case an unused hwspinlock isn't available. Users of this
|
||||
API will usually want to communicate the lock's id to the remote core
|
||||
before it can be used to achieve synchronization.
|
||||
|
||||
Should be called from a process context (might sleep).
|
||||
|
||||
::
|
||||
|
||||
struct hwspinlock *hwspin_lock_request_specific(unsigned int id);
|
||||
|
||||
Assign a specific hwspinlock id and return its address, or NULL
|
||||
if that hwspinlock is already in use. Usually board code will
|
||||
be calling this function in order to reserve specific hwspinlock
|
||||
ids for predefined purposes.
|
||||
|
||||
Should be called from a process context (might sleep).
|
||||
|
||||
::
|
||||
|
||||
int of_hwspin_lock_get_id(struct device_node *np, int index);
|
||||
|
||||
Retrieve the global lock id for an OF phandle-based specific lock.
|
||||
This function provides a means for DT users of a hwspinlock module
|
||||
to get the global lock id of a specific hwspinlock, so that it can
|
||||
be requested using the normal hwspin_lock_request_specific() API.
|
||||
|
||||
The function returns a lock id number on success, -EPROBE_DEFER if
|
||||
the hwspinlock device is not yet registered with the core, or other
|
||||
error values.
|
||||
|
||||
Should be called from a process context (might sleep).
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_free(struct hwspinlock *hwlock);
|
||||
|
||||
Free a previously-assigned hwspinlock; returns 0 on success, or an
|
||||
appropriate error code on failure (e.g. -EINVAL if the hwspinlock
|
||||
is already free).
|
||||
|
||||
Should be called from a process context (might sleep).
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout);
|
||||
|
||||
Lock a previously-assigned hwspinlock with a timeout limit (specified in
|
||||
msecs). If the hwspinlock is already taken, the function will busy loop
|
||||
waiting for it to be released, but give up when the timeout elapses.
|
||||
Upon a successful return from this function, preemption is disabled so
|
||||
the caller must not sleep, and is advised to release the hwspinlock as
|
||||
soon as possible, in order to minimize remote cores polling on the
|
||||
hardware interconnect.
|
||||
|
||||
Returns 0 when successful and an appropriate error code otherwise (most
|
||||
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout);
|
||||
|
||||
Lock a previously-assigned hwspinlock with a timeout limit (specified in
|
||||
msecs). If the hwspinlock is already taken, the function will busy loop
|
||||
waiting for it to be released, but give up when the timeout elapses.
|
||||
Upon a successful return from this function, preemption and the local
|
||||
interrupts are disabled, so the caller must not sleep, and is advised to
|
||||
release the hwspinlock as soon as possible.
|
||||
|
||||
Returns 0 when successful and an appropriate error code otherwise (most
|
||||
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to,
|
||||
unsigned long *flags);
|
||||
|
||||
Lock a previously-assigned hwspinlock with a timeout limit (specified in
|
||||
msecs). If the hwspinlock is already taken, the function will busy loop
|
||||
waiting for it to be released, but give up when the timeout elapses.
|
||||
Upon a successful return from this function, preemption is disabled,
|
||||
local interrupts are disabled and their previous state is saved at the
|
||||
given flags placeholder. The caller must not sleep, and is advised to
|
||||
release the hwspinlock as soon as possible.
|
||||
|
||||
Returns 0 when successful and an appropriate error code otherwise (most
|
||||
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
|
||||
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_timeout_raw(struct hwspinlock *hwlock, unsigned int timeout);
|
||||
|
||||
Lock a previously-assigned hwspinlock with a timeout limit (specified in
|
||||
msecs). If the hwspinlock is already taken, the function will busy loop
|
||||
waiting for it to be released, but give up when the timeout elapses.
|
||||
|
||||
Caution: User must protect the routine of getting hardware lock with mutex
|
||||
or spinlock to avoid dead-lock, that will let user can do some time-consuming
|
||||
or sleepable operations under the hardware lock.
|
||||
|
||||
Returns 0 when successful and an appropriate error code otherwise (most
|
||||
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
|
||||
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_timeout_in_atomic(struct hwspinlock *hwlock, unsigned int to);
|
||||
|
||||
Lock a previously-assigned hwspinlock with a timeout limit (specified in
|
||||
msecs). If the hwspinlock is already taken, the function will busy loop
|
||||
waiting for it to be released, but give up when the timeout elapses.
|
||||
|
||||
This function shall be called only from an atomic context and the timeout
|
||||
value shall not exceed a few msecs.
|
||||
|
||||
Returns 0 when successful and an appropriate error code otherwise (most
|
||||
notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs).
|
||||
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_trylock(struct hwspinlock *hwlock);
|
||||
|
||||
|
||||
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
|
||||
it is already taken.
|
||||
|
||||
Upon a successful return from this function, preemption is disabled so
|
||||
caller must not sleep, and is advised to release the hwspinlock as soon as
|
||||
possible, in order to minimize remote cores polling on the hardware
|
||||
interconnect.
|
||||
|
||||
Returns 0 on success and an appropriate error code otherwise (most
|
||||
notably -EBUSY if the hwspinlock was already taken).
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_trylock_irq(struct hwspinlock *hwlock);
|
||||
|
||||
|
||||
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
|
||||
it is already taken.
|
||||
|
||||
Upon a successful return from this function, preemption and the local
|
||||
interrupts are disabled so caller must not sleep, and is advised to
|
||||
release the hwspinlock as soon as possible.
|
||||
|
||||
Returns 0 on success and an appropriate error code otherwise (most
|
||||
notably -EBUSY if the hwspinlock was already taken).
|
||||
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags);
|
||||
|
||||
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
|
||||
it is already taken.
|
||||
|
||||
Upon a successful return from this function, preemption is disabled,
|
||||
the local interrupts are disabled and their previous state is saved
|
||||
at the given flags placeholder. The caller must not sleep, and is advised
|
||||
to release the hwspinlock as soon as possible.
|
||||
|
||||
Returns 0 on success and an appropriate error code otherwise (most
|
||||
notably -EBUSY if the hwspinlock was already taken).
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_trylock_raw(struct hwspinlock *hwlock);
|
||||
|
||||
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
|
||||
it is already taken.
|
||||
|
||||
Caution: User must protect the routine of getting hardware lock with mutex
|
||||
or spinlock to avoid dead-lock, that will let user can do some time-consuming
|
||||
or sleepable operations under the hardware lock.
|
||||
|
||||
Returns 0 on success and an appropriate error code otherwise (most
|
||||
notably -EBUSY if the hwspinlock was already taken).
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_trylock_in_atomic(struct hwspinlock *hwlock);
|
||||
|
||||
Attempt to lock a previously-assigned hwspinlock, but immediately fail if
|
||||
it is already taken.
|
||||
|
||||
This function shall be called only from an atomic context.
|
||||
|
||||
Returns 0 on success and an appropriate error code otherwise (most
|
||||
notably -EBUSY if the hwspinlock was already taken).
|
||||
The function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
void hwspin_unlock(struct hwspinlock *hwlock);
|
||||
|
||||
Unlock a previously-locked hwspinlock. Always succeed, and can be called
|
||||
from any context (the function never sleeps).
|
||||
|
||||
.. note::
|
||||
|
||||
code should **never** unlock an hwspinlock which is already unlocked
|
||||
(there is no protection against this).
|
||||
|
||||
::
|
||||
|
||||
void hwspin_unlock_irq(struct hwspinlock *hwlock);
|
||||
|
||||
Unlock a previously-locked hwspinlock and enable local interrupts.
|
||||
The caller should **never** unlock an hwspinlock which is already unlocked.
|
||||
|
||||
Doing so is considered a bug (there is no protection against this).
|
||||
Upon a successful return from this function, preemption and local
|
||||
interrupts are enabled. This function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
void
|
||||
hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags);
|
||||
|
||||
Unlock a previously-locked hwspinlock.
|
||||
|
||||
The caller should **never** unlock an hwspinlock which is already unlocked.
|
||||
Doing so is considered a bug (there is no protection against this).
|
||||
Upon a successful return from this function, preemption is reenabled,
|
||||
and the state of the local interrupts is restored to the state saved at
|
||||
the given flags. This function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
void hwspin_unlock_raw(struct hwspinlock *hwlock);
|
||||
|
||||
Unlock a previously-locked hwspinlock.
|
||||
|
||||
The caller should **never** unlock an hwspinlock which is already unlocked.
|
||||
Doing so is considered a bug (there is no protection against this).
|
||||
This function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
void hwspin_unlock_in_atomic(struct hwspinlock *hwlock);
|
||||
|
||||
Unlock a previously-locked hwspinlock.
|
||||
|
||||
The caller should **never** unlock an hwspinlock which is already unlocked.
|
||||
Doing so is considered a bug (there is no protection against this).
|
||||
This function will never sleep.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_get_id(struct hwspinlock *hwlock);
|
||||
|
||||
Retrieve id number of a given hwspinlock. This is needed when an
|
||||
hwspinlock is dynamically assigned: before it can be used to achieve
|
||||
mutual exclusion with a remote cpu, the id number should be communicated
|
||||
to the remote task with which we want to synchronize.
|
||||
|
||||
Returns the hwspinlock id number, or -EINVAL if hwlock is null.
|
||||
|
||||
Typical usage
|
||||
=============
|
||||
|
||||
::
|
||||
|
||||
#include <linux/hwspinlock.h>
|
||||
#include <linux/err.h>
|
||||
|
||||
int hwspinlock_example1(void)
|
||||
{
|
||||
struct hwspinlock *hwlock;
|
||||
int ret;
|
||||
|
||||
/* dynamically assign a hwspinlock */
|
||||
hwlock = hwspin_lock_request();
|
||||
if (!hwlock)
|
||||
...
|
||||
|
||||
id = hwspin_lock_get_id(hwlock);
|
||||
/* probably need to communicate id to a remote processor now */
|
||||
|
||||
/* take the lock, spin for 1 sec if it's already taken */
|
||||
ret = hwspin_lock_timeout(hwlock, 1000);
|
||||
if (ret)
|
||||
...
|
||||
|
||||
/*
|
||||
* we took the lock, do our thing now, but do NOT sleep
|
||||
*/
|
||||
|
||||
/* release the lock */
|
||||
hwspin_unlock(hwlock);
|
||||
|
||||
/* free the lock */
|
||||
ret = hwspin_lock_free(hwlock);
|
||||
if (ret)
|
||||
...
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
int hwspinlock_example2(void)
|
||||
{
|
||||
struct hwspinlock *hwlock;
|
||||
int ret;
|
||||
|
||||
/*
|
||||
* assign a specific hwspinlock id - this should be called early
|
||||
* by board init code.
|
||||
*/
|
||||
hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID);
|
||||
if (!hwlock)
|
||||
...
|
||||
|
||||
/* try to take it, but don't spin on it */
|
||||
ret = hwspin_trylock(hwlock);
|
||||
if (!ret) {
|
||||
pr_info("lock is already taken\n");
|
||||
return -EBUSY;
|
||||
}
|
||||
|
||||
/*
|
||||
* we took the lock, do our thing now, but do NOT sleep
|
||||
*/
|
||||
|
||||
/* release the lock */
|
||||
hwspin_unlock(hwlock);
|
||||
|
||||
/* free the lock */
|
||||
ret = hwspin_lock_free(hwlock);
|
||||
if (ret)
|
||||
...
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
||||
API for implementors
|
||||
====================
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev,
|
||||
const struct hwspinlock_ops *ops, int base_id, int num_locks);
|
||||
|
||||
To be called from the underlying platform-specific implementation, in
|
||||
order to register a new hwspinlock device (which is usually a bank of
|
||||
numerous locks). Should be called from a process context (this function
|
||||
might sleep).
|
||||
|
||||
Returns 0 on success, or appropriate error code on failure.
|
||||
|
||||
::
|
||||
|
||||
int hwspin_lock_unregister(struct hwspinlock_device *bank);
|
||||
|
||||
To be called from the underlying vendor-specific implementation, in order
|
||||
to unregister an hwspinlock device (which is usually a bank of numerous
|
||||
locks).
|
||||
|
||||
Should be called from a process context (this function might sleep).
|
||||
|
||||
Returns the address of hwspinlock on success, or NULL on error (e.g.
|
||||
if the hwspinlock is still in use).
|
||||
|
||||
Important structs
|
||||
=================
|
||||
|
||||
struct hwspinlock_device is a device which usually contains a bank
|
||||
of hardware locks. It is registered by the underlying hwspinlock
|
||||
implementation using the hwspin_lock_register() API.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* struct hwspinlock_device - a device which usually spans numerous hwspinlocks
|
||||
* @dev: underlying device, will be used to invoke runtime PM api
|
||||
* @ops: platform-specific hwspinlock handlers
|
||||
* @base_id: id index of the first lock in this device
|
||||
* @num_locks: number of locks in this device
|
||||
* @lock: dynamically allocated array of 'struct hwspinlock'
|
||||
*/
|
||||
struct hwspinlock_device {
|
||||
struct device *dev;
|
||||
const struct hwspinlock_ops *ops;
|
||||
int base_id;
|
||||
int num_locks;
|
||||
struct hwspinlock lock[0];
|
||||
};
|
||||
|
||||
struct hwspinlock_device contains an array of hwspinlock structs, each
|
||||
of which represents a single hardware lock::
|
||||
|
||||
/**
|
||||
* struct hwspinlock - this struct represents a single hwspinlock instance
|
||||
* @bank: the hwspinlock_device structure which owns this lock
|
||||
* @lock: initialized and used by hwspinlock core
|
||||
* @priv: private data, owned by the underlying platform-specific hwspinlock drv
|
||||
*/
|
||||
struct hwspinlock {
|
||||
struct hwspinlock_device *bank;
|
||||
spinlock_t lock;
|
||||
void *priv;
|
||||
};
|
||||
|
||||
When registering a bank of locks, the hwspinlock driver only needs to
|
||||
set the priv members of the locks. The rest of the members are set and
|
||||
initialized by the hwspinlock core itself.
|
||||
|
||||
Implementation callbacks
|
||||
========================
|
||||
|
||||
There are three possible callbacks defined in 'struct hwspinlock_ops'::
|
||||
|
||||
struct hwspinlock_ops {
|
||||
int (*trylock)(struct hwspinlock *lock);
|
||||
void (*unlock)(struct hwspinlock *lock);
|
||||
void (*relax)(struct hwspinlock *lock);
|
||||
};
|
||||
|
||||
The first two callbacks are mandatory:
|
||||
|
||||
The ->trylock() callback should make a single attempt to take the lock, and
|
||||
return 0 on failure and 1 on success. This callback may **not** sleep.
|
||||
|
||||
The ->unlock() callback releases the lock. It always succeed, and it, too,
|
||||
may **not** sleep.
|
||||
|
||||
The ->relax() callback is optional. It is called by hwspinlock core while
|
||||
spinning on a lock, and can be used by the underlying implementation to force
|
||||
a delay between two successive invocations of ->trylock(). It may **not** sleep.
|
@@ -16,6 +16,13 @@ locking
|
||||
rt-mutex
|
||||
spinlocks
|
||||
ww-mutex-design
|
||||
preempt-locking
|
||||
pi-futex
|
||||
futex-requeue-pi
|
||||
hwspinlock
|
||||
percpu-rw-semaphore
|
||||
robust-futexes
|
||||
robust-futex-ABI
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
28
Documentation/locking/percpu-rw-semaphore.rst
Звичайний файл
28
Documentation/locking/percpu-rw-semaphore.rst
Звичайний файл
@@ -0,0 +1,28 @@
|
||||
====================
|
||||
Percpu rw semaphores
|
||||
====================
|
||||
|
||||
Percpu rw semaphores is a new read-write semaphore design that is
|
||||
optimized for locking for reading.
|
||||
|
||||
The problem with traditional read-write semaphores is that when multiple
|
||||
cores take the lock for reading, the cache line containing the semaphore
|
||||
is bouncing between L1 caches of the cores, causing performance
|
||||
degradation.
|
||||
|
||||
Locking for reading is very fast, it uses RCU and it avoids any atomic
|
||||
instruction in the lock and unlock path. On the other hand, locking for
|
||||
writing is very expensive, it calls synchronize_rcu() that can take
|
||||
hundreds of milliseconds.
|
||||
|
||||
The lock is declared with "struct percpu_rw_semaphore" type.
|
||||
The lock is initialized percpu_init_rwsem, it returns 0 on success and
|
||||
-ENOMEM on allocation failure.
|
||||
The lock must be freed with percpu_free_rwsem to avoid memory leak.
|
||||
|
||||
The lock is locked for read with percpu_down_read, percpu_up_read and
|
||||
for write with percpu_down_write, percpu_up_write.
|
||||
|
||||
The idea of using RCU for optimized rw-lock was introduced by
|
||||
Eric Dumazet <eric.dumazet@gmail.com>.
|
||||
The code was written by Mikulas Patocka <mpatocka@redhat.com>
|
122
Documentation/locking/pi-futex.rst
Звичайний файл
122
Documentation/locking/pi-futex.rst
Звичайний файл
@@ -0,0 +1,122 @@
|
||||
======================
|
||||
Lightweight PI-futexes
|
||||
======================
|
||||
|
||||
We are calling them lightweight for 3 reasons:
|
||||
|
||||
- in the user-space fastpath a PI-enabled futex involves no kernel work
|
||||
(or any other PI complexity) at all. No registration, no extra kernel
|
||||
calls - just pure fast atomic ops in userspace.
|
||||
|
||||
- even in the slowpath, the system call and scheduling pattern is very
|
||||
similar to normal futexes.
|
||||
|
||||
- the in-kernel PI implementation is streamlined around the mutex
|
||||
abstraction, with strict rules that keep the implementation
|
||||
relatively simple: only a single owner may own a lock (i.e. no
|
||||
read-write lock support), only the owner may unlock a lock, no
|
||||
recursive locking, etc.
|
||||
|
||||
Priority Inheritance - why?
|
||||
---------------------------
|
||||
|
||||
The short reply: user-space PI helps achieving/improving determinism for
|
||||
user-space applications. In the best-case, it can help achieve
|
||||
determinism and well-bound latencies. Even in the worst-case, PI will
|
||||
improve the statistical distribution of locking related application
|
||||
delays.
|
||||
|
||||
The longer reply
|
||||
----------------
|
||||
|
||||
Firstly, sharing locks between multiple tasks is a common programming
|
||||
technique that often cannot be replaced with lockless algorithms. As we
|
||||
can see it in the kernel [which is a quite complex program in itself],
|
||||
lockless structures are rather the exception than the norm - the current
|
||||
ratio of lockless vs. locky code for shared data structures is somewhere
|
||||
between 1:10 and 1:100. Lockless is hard, and the complexity of lockless
|
||||
algorithms often endangers to ability to do robust reviews of said code.
|
||||
I.e. critical RT apps often choose lock structures to protect critical
|
||||
data structures, instead of lockless algorithms. Furthermore, there are
|
||||
cases (like shared hardware, or other resource limits) where lockless
|
||||
access is mathematically impossible.
|
||||
|
||||
Media players (such as Jack) are an example of reasonable application
|
||||
design with multiple tasks (with multiple priority levels) sharing
|
||||
short-held locks: for example, a highprio audio playback thread is
|
||||
combined with medium-prio construct-audio-data threads and low-prio
|
||||
display-colory-stuff threads. Add video and decoding to the mix and
|
||||
we've got even more priority levels.
|
||||
|
||||
So once we accept that synchronization objects (locks) are an
|
||||
unavoidable fact of life, and once we accept that multi-task userspace
|
||||
apps have a very fair expectation of being able to use locks, we've got
|
||||
to think about how to offer the option of a deterministic locking
|
||||
implementation to user-space.
|
||||
|
||||
Most of the technical counter-arguments against doing priority
|
||||
inheritance only apply to kernel-space locks. But user-space locks are
|
||||
different, there we cannot disable interrupts or make the task
|
||||
non-preemptible in a critical section, so the 'use spinlocks' argument
|
||||
does not apply (user-space spinlocks have the same priority inversion
|
||||
problems as other user-space locking constructs). Fact is, pretty much
|
||||
the only technique that currently enables good determinism for userspace
|
||||
locks (such as futex-based pthread mutexes) is priority inheritance:
|
||||
|
||||
Currently (without PI), if a high-prio and a low-prio task shares a lock
|
||||
[this is a quite common scenario for most non-trivial RT applications],
|
||||
even if all critical sections are coded carefully to be deterministic
|
||||
(i.e. all critical sections are short in duration and only execute a
|
||||
limited number of instructions), the kernel cannot guarantee any
|
||||
deterministic execution of the high-prio task: any medium-priority task
|
||||
could preempt the low-prio task while it holds the shared lock and
|
||||
executes the critical section, and could delay it indefinitely.
|
||||
|
||||
Implementation
|
||||
--------------
|
||||
|
||||
As mentioned before, the userspace fastpath of PI-enabled pthread
|
||||
mutexes involves no kernel work at all - they behave quite similarly to
|
||||
normal futex-based locks: a 0 value means unlocked, and a value==TID
|
||||
means locked. (This is the same method as used by list-based robust
|
||||
futexes.) Userspace uses atomic ops to lock/unlock these mutexes without
|
||||
entering the kernel.
|
||||
|
||||
To handle the slowpath, we have added two new futex ops:
|
||||
|
||||
- FUTEX_LOCK_PI
|
||||
- FUTEX_UNLOCK_PI
|
||||
|
||||
If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to
|
||||
TID fails], then FUTEX_LOCK_PI is called. The kernel does all the
|
||||
remaining work: if there is no futex-queue attached to the futex address
|
||||
yet then the code looks up the task that owns the futex [it has put its
|
||||
own TID into the futex value], and attaches a 'PI state' structure to
|
||||
the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware,
|
||||
kernel-based synchronization object. The 'other' task is made the owner
|
||||
of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the
|
||||
futex value. Then this task tries to lock the rt-mutex, on which it
|
||||
blocks. Once it returns, it has the mutex acquired, and it sets the
|
||||
futex value to its own TID and returns. Userspace has no other work to
|
||||
perform - it now owns the lock, and futex value contains
|
||||
FUTEX_WAITERS|TID.
|
||||
|
||||
If the unlock side fastpath succeeds, [i.e. userspace manages to do a
|
||||
TID -> 0 atomic transition of the futex value], then no kernel work is
|
||||
triggered.
|
||||
|
||||
If the unlock fastpath fails (because the FUTEX_WAITERS bit is set),
|
||||
then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the
|
||||
behalf of userspace - and it also unlocks the attached
|
||||
pi_state->rt_mutex and thus wakes up any potential waiters.
|
||||
|
||||
Note that under this approach, contrary to previous PI-futex approaches,
|
||||
there is no prior 'registration' of a PI-futex. [which is not quite
|
||||
possible anyway, due to existing ABI properties of pthread mutexes.]
|
||||
|
||||
Also, under this scheme, 'robustness' and 'PI' are two orthogonal
|
||||
properties of futexes, and all four combinations are possible: futex,
|
||||
robust-futex, PI-futex, robust+PI-futex.
|
||||
|
||||
More details about priority inheritance can be found in
|
||||
Documentation/locking/rt-mutex.rst.
|
144
Documentation/locking/preempt-locking.rst
Звичайний файл
144
Documentation/locking/preempt-locking.rst
Звичайний файл
@@ -0,0 +1,144 @@
|
||||
===========================================================================
|
||||
Proper Locking Under a Preemptible Kernel: Keeping Kernel Code Preempt-Safe
|
||||
===========================================================================
|
||||
|
||||
:Author: Robert Love <rml@tech9.net>
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
|
||||
A preemptible kernel creates new locking issues. The issues are the same as
|
||||
those under SMP: concurrency and reentrancy. Thankfully, the Linux preemptible
|
||||
kernel model leverages existing SMP locking mechanisms. Thus, the kernel
|
||||
requires explicit additional locking for very few additional situations.
|
||||
|
||||
This document is for all kernel hackers. Developing code in the kernel
|
||||
requires protecting these situations.
|
||||
|
||||
|
||||
RULE #1: Per-CPU data structures need explicit protection
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
Two similar problems arise. An example code snippet::
|
||||
|
||||
struct this_needs_locking tux[NR_CPUS];
|
||||
tux[smp_processor_id()] = some_value;
|
||||
/* task is preempted here... */
|
||||
something = tux[smp_processor_id()];
|
||||
|
||||
First, since the data is per-CPU, it may not have explicit SMP locking, but
|
||||
require it otherwise. Second, when a preempted task is finally rescheduled,
|
||||
the previous value of smp_processor_id may not equal the current. You must
|
||||
protect these situations by disabling preemption around them.
|
||||
|
||||
You can also use put_cpu() and get_cpu(), which will disable preemption.
|
||||
|
||||
|
||||
RULE #2: CPU state must be protected.
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
Under preemption, the state of the CPU must be protected. This is arch-
|
||||
dependent, but includes CPU structures and state not preserved over a context
|
||||
switch. For example, on x86, entering and exiting FPU mode is now a critical
|
||||
section that must occur while preemption is disabled. Think what would happen
|
||||
if the kernel is executing a floating-point instruction and is then preempted.
|
||||
Remember, the kernel does not save FPU state except for user tasks. Therefore,
|
||||
upon preemption, the FPU registers will be sold to the lowest bidder. Thus,
|
||||
preemption must be disabled around such regions.
|
||||
|
||||
Note, some FPU functions are already explicitly preempt safe. For example,
|
||||
kernel_fpu_begin and kernel_fpu_end will disable and enable preemption.
|
||||
|
||||
|
||||
RULE #3: Lock acquire and release must be performed by same task
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
A lock acquired in one task must be released by the same task. This
|
||||
means you can't do oddball things like acquire a lock and go off to
|
||||
play while another task releases it. If you want to do something
|
||||
like this, acquire and release the task in the same code path and
|
||||
have the caller wait on an event by the other task.
|
||||
|
||||
|
||||
Solution
|
||||
========
|
||||
|
||||
|
||||
Data protection under preemption is achieved by disabling preemption for the
|
||||
duration of the critical region.
|
||||
|
||||
::
|
||||
|
||||
preempt_enable() decrement the preempt counter
|
||||
preempt_disable() increment the preempt counter
|
||||
preempt_enable_no_resched() decrement, but do not immediately preempt
|
||||
preempt_check_resched() if needed, reschedule
|
||||
preempt_count() return the preempt counter
|
||||
|
||||
The functions are nestable. In other words, you can call preempt_disable
|
||||
n-times in a code path, and preemption will not be reenabled until the n-th
|
||||
call to preempt_enable. The preempt statements define to nothing if
|
||||
preemption is not enabled.
|
||||
|
||||
Note that you do not need to explicitly prevent preemption if you are holding
|
||||
any locks or interrupts are disabled, since preemption is implicitly disabled
|
||||
in those cases.
|
||||
|
||||
But keep in mind that 'irqs disabled' is a fundamentally unsafe way of
|
||||
disabling preemption - any cond_resched() or cond_resched_lock() might trigger
|
||||
a reschedule if the preempt count is 0. A simple printk() might trigger a
|
||||
reschedule. So use this implicit preemption-disabling property only if you
|
||||
know that the affected codepath does not do any of this. Best policy is to use
|
||||
this only for small, atomic code that you wrote and which calls no complex
|
||||
functions.
|
||||
|
||||
Example::
|
||||
|
||||
cpucache_t *cc; /* this is per-CPU */
|
||||
preempt_disable();
|
||||
cc = cc_data(searchp);
|
||||
if (cc && cc->avail) {
|
||||
__free_block(searchp, cc_entry(cc), cc->avail);
|
||||
cc->avail = 0;
|
||||
}
|
||||
preempt_enable();
|
||||
return 0;
|
||||
|
||||
Notice how the preemption statements must encompass every reference of the
|
||||
critical variables. Another example::
|
||||
|
||||
int buf[NR_CPUS];
|
||||
set_cpu_val(buf);
|
||||
if (buf[smp_processor_id()] == -1) printf(KERN_INFO "wee!\n");
|
||||
spin_lock(&buf_lock);
|
||||
/* ... */
|
||||
|
||||
This code is not preempt-safe, but see how easily we can fix it by simply
|
||||
moving the spin_lock up two lines.
|
||||
|
||||
|
||||
Preventing preemption using interrupt disabling
|
||||
===============================================
|
||||
|
||||
|
||||
It is possible to prevent a preemption event using local_irq_disable and
|
||||
local_irq_save. Note, when doing so, you must be very careful to not cause
|
||||
an event that would set need_resched and result in a preemption check. When
|
||||
in doubt, rely on locking or explicit preemption disabling.
|
||||
|
||||
Note in 2.5 interrupt disabling is now only per-CPU (e.g. local).
|
||||
|
||||
An additional concern is proper usage of local_irq_disable and local_irq_save.
|
||||
These may be used to protect from preemption, however, on exit, if preemption
|
||||
may be enabled, a test to see if preemption is required should be done. If
|
||||
these are called from the spin_lock and read/write lock macros, the right thing
|
||||
is done. They may also be called within a spin-lock protected region, however,
|
||||
if they are ever called outside of this context, a test for preemption should
|
||||
be made. Do note that calls from interrupt context or bottom half/ tasklets
|
||||
are also protected by preemption locks and so may use the versions which do
|
||||
not check preemption.
|
184
Documentation/locking/robust-futex-ABI.rst
Звичайний файл
184
Documentation/locking/robust-futex-ABI.rst
Звичайний файл
@@ -0,0 +1,184 @@
|
||||
====================
|
||||
The robust futex ABI
|
||||
====================
|
||||
|
||||
:Author: Started by Paul Jackson <pj@sgi.com>
|
||||
|
||||
|
||||
Robust_futexes provide a mechanism that is used in addition to normal
|
||||
futexes, for kernel assist of cleanup of held locks on task exit.
|
||||
|
||||
The interesting data as to what futexes a thread is holding is kept on a
|
||||
linked list in user space, where it can be updated efficiently as locks
|
||||
are taken and dropped, without kernel intervention. The only additional
|
||||
kernel intervention required for robust_futexes above and beyond what is
|
||||
required for futexes is:
|
||||
|
||||
1) a one time call, per thread, to tell the kernel where its list of
|
||||
held robust_futexes begins, and
|
||||
2) internal kernel code at exit, to handle any listed locks held
|
||||
by the exiting thread.
|
||||
|
||||
The existing normal futexes already provide a "Fast Userspace Locking"
|
||||
mechanism, which handles uncontested locking without needing a system
|
||||
call, and handles contested locking by maintaining a list of waiting
|
||||
threads in the kernel. Options on the sys_futex(2) system call support
|
||||
waiting on a particular futex, and waking up the next waiter on a
|
||||
particular futex.
|
||||
|
||||
For robust_futexes to work, the user code (typically in a library such
|
||||
as glibc linked with the application) has to manage and place the
|
||||
necessary list elements exactly as the kernel expects them. If it fails
|
||||
to do so, then improperly listed locks will not be cleaned up on exit,
|
||||
probably causing deadlock or other such failure of the other threads
|
||||
waiting on the same locks.
|
||||
|
||||
A thread that anticipates possibly using robust_futexes should first
|
||||
issue the system call::
|
||||
|
||||
asmlinkage long
|
||||
sys_set_robust_list(struct robust_list_head __user *head, size_t len);
|
||||
|
||||
The pointer 'head' points to a structure in the threads address space
|
||||
consisting of three words. Each word is 32 bits on 32 bit arch's, or 64
|
||||
bits on 64 bit arch's, and local byte order. Each thread should have
|
||||
its own thread private 'head'.
|
||||
|
||||
If a thread is running in 32 bit compatibility mode on a 64 native arch
|
||||
kernel, then it can actually have two such structures - one using 32 bit
|
||||
words for 32 bit compatibility mode, and one using 64 bit words for 64
|
||||
bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bit
|
||||
compatibility mode, will attempt to process both lists on each task
|
||||
exit, if the corresponding sys_set_robust_list() call has been made to
|
||||
setup that list.
|
||||
|
||||
The first word in the memory structure at 'head' contains a
|
||||
pointer to a single linked list of 'lock entries', one per lock,
|
||||
as described below. If the list is empty, the pointer will point
|
||||
to itself, 'head'. The last 'lock entry' points back to the 'head'.
|
||||
|
||||
The second word, called 'offset', specifies the offset from the
|
||||
address of the associated 'lock entry', plus or minus, of what will
|
||||
be called the 'lock word', from that 'lock entry'. The 'lock word'
|
||||
is always a 32 bit word, unlike the other words above. The 'lock
|
||||
word' holds 2 flag bits in the upper 2 bits, and the thread id (TID)
|
||||
of the thread holding the lock in the bottom 30 bits. See further
|
||||
below for a description of the flag bits.
|
||||
|
||||
The third word, called 'list_op_pending', contains transient copy of
|
||||
the address of the 'lock entry', during list insertion and removal,
|
||||
and is needed to correctly resolve races should a thread exit while
|
||||
in the middle of a locking or unlocking operation.
|
||||
|
||||
Each 'lock entry' on the single linked list starting at 'head' consists
|
||||
of just a single word, pointing to the next 'lock entry', or back to
|
||||
'head' if there are no more entries. In addition, nearby to each 'lock
|
||||
entry', at an offset from the 'lock entry' specified by the 'offset'
|
||||
word, is one 'lock word'.
|
||||
|
||||
The 'lock word' is always 32 bits, and is intended to be the same 32 bit
|
||||
lock variable used by the futex mechanism, in conjunction with
|
||||
robust_futexes. The kernel will only be able to wakeup the next thread
|
||||
waiting for a lock on a threads exit if that next thread used the futex
|
||||
mechanism to register the address of that 'lock word' with the kernel.
|
||||
|
||||
For each futex lock currently held by a thread, if it wants this
|
||||
robust_futex support for exit cleanup of that lock, it should have one
|
||||
'lock entry' on this list, with its associated 'lock word' at the
|
||||
specified 'offset'. Should a thread die while holding any such locks,
|
||||
the kernel will walk this list, mark any such locks with a bit
|
||||
indicating their holder died, and wakeup the next thread waiting for
|
||||
that lock using the futex mechanism.
|
||||
|
||||
When a thread has invoked the above system call to indicate it
|
||||
anticipates using robust_futexes, the kernel stores the passed in 'head'
|
||||
pointer for that task. The task may retrieve that value later on by
|
||||
using the system call::
|
||||
|
||||
asmlinkage long
|
||||
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
|
||||
size_t __user *len_ptr);
|
||||
|
||||
It is anticipated that threads will use robust_futexes embedded in
|
||||
larger, user level locking structures, one per lock. The kernel
|
||||
robust_futex mechanism doesn't care what else is in that structure, so
|
||||
long as the 'offset' to the 'lock word' is the same for all
|
||||
robust_futexes used by that thread. The thread should link those locks
|
||||
it currently holds using the 'lock entry' pointers. It may also have
|
||||
other links between the locks, such as the reverse side of a double
|
||||
linked list, but that doesn't matter to the kernel.
|
||||
|
||||
By keeping its locks linked this way, on a list starting with a 'head'
|
||||
pointer known to the kernel, the kernel can provide to a thread the
|
||||
essential service available for robust_futexes, which is to help clean
|
||||
up locks held at the time of (a perhaps unexpectedly) exit.
|
||||
|
||||
Actual locking and unlocking, during normal operations, is handled
|
||||
entirely by user level code in the contending threads, and by the
|
||||
existing futex mechanism to wait for, and wakeup, locks. The kernels
|
||||
only essential involvement in robust_futexes is to remember where the
|
||||
list 'head' is, and to walk the list on thread exit, handling locks
|
||||
still held by the departing thread, as described below.
|
||||
|
||||
There may exist thousands of futex lock structures in a threads shared
|
||||
memory, on various data structures, at a given point in time. Only those
|
||||
lock structures for locks currently held by that thread should be on
|
||||
that thread's robust_futex linked lock list a given time.
|
||||
|
||||
A given futex lock structure in a user shared memory region may be held
|
||||
at different times by any of the threads with access to that region. The
|
||||
thread currently holding such a lock, if any, is marked with the threads
|
||||
TID in the lower 30 bits of the 'lock word'.
|
||||
|
||||
When adding or removing a lock from its list of held locks, in order for
|
||||
the kernel to correctly handle lock cleanup regardless of when the task
|
||||
exits (perhaps it gets an unexpected signal 9 in the middle of
|
||||
manipulating this list), the user code must observe the following
|
||||
protocol on 'lock entry' insertion and removal:
|
||||
|
||||
On insertion:
|
||||
|
||||
1) set the 'list_op_pending' word to the address of the 'lock entry'
|
||||
to be inserted,
|
||||
2) acquire the futex lock,
|
||||
3) add the lock entry, with its thread id (TID) in the bottom 30 bits
|
||||
of the 'lock word', to the linked list starting at 'head', and
|
||||
4) clear the 'list_op_pending' word.
|
||||
|
||||
On removal:
|
||||
|
||||
1) set the 'list_op_pending' word to the address of the 'lock entry'
|
||||
to be removed,
|
||||
2) remove the lock entry for this lock from the 'head' list,
|
||||
3) release the futex lock, and
|
||||
4) clear the 'lock_op_pending' word.
|
||||
|
||||
On exit, the kernel will consider the address stored in
|
||||
'list_op_pending' and the address of each 'lock word' found by walking
|
||||
the list starting at 'head'. For each such address, if the bottom 30
|
||||
bits of the 'lock word' at offset 'offset' from that address equals the
|
||||
exiting threads TID, then the kernel will do two things:
|
||||
|
||||
1) if bit 31 (0x80000000) is set in that word, then attempt a futex
|
||||
wakeup on that address, which will waken the next thread that has
|
||||
used to the futex mechanism to wait on that address, and
|
||||
2) atomically set bit 30 (0x40000000) in the 'lock word'.
|
||||
|
||||
In the above, bit 31 was set by futex waiters on that lock to indicate
|
||||
they were waiting, and bit 30 is set by the kernel to indicate that the
|
||||
lock owner died holding the lock.
|
||||
|
||||
The kernel exit code will silently stop scanning the list further if at
|
||||
any point:
|
||||
|
||||
1) the 'head' pointer or an subsequent linked list pointer
|
||||
is not a valid address of a user space word
|
||||
2) the calculated location of the 'lock word' (address plus
|
||||
'offset') is not the valid address of a 32 bit user space
|
||||
word
|
||||
3) if the list contains more than 1 million (subject to
|
||||
future kernel configuration changes) elements.
|
||||
|
||||
When the kernel sees a list entry whose 'lock word' doesn't have the
|
||||
current threads TID in the lower 30 bits, it does nothing with that
|
||||
entry, and goes on to the next entry.
|
221
Documentation/locking/robust-futexes.rst
Звичайний файл
221
Documentation/locking/robust-futexes.rst
Звичайний файл
@@ -0,0 +1,221 @@
|
||||
========================================
|
||||
A description of what robust futexes are
|
||||
========================================
|
||||
|
||||
:Started by: Ingo Molnar <mingo@redhat.com>
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
what are robust futexes? To answer that, we first need to understand
|
||||
what futexes are: normal futexes are special types of locks that in the
|
||||
noncontended case can be acquired/released from userspace without having
|
||||
to enter the kernel.
|
||||
|
||||
A futex is in essence a user-space address, e.g. a 32-bit lock variable
|
||||
field. If userspace notices contention (the lock is already owned and
|
||||
someone else wants to grab it too) then the lock is marked with a value
|
||||
that says "there's a waiter pending", and the sys_futex(FUTEX_WAIT)
|
||||
syscall is used to wait for the other guy to release it. The kernel
|
||||
creates a 'futex queue' internally, so that it can later on match up the
|
||||
waiter with the waker - without them having to know about each other.
|
||||
When the owner thread releases the futex, it notices (via the variable
|
||||
value) that there were waiter(s) pending, and does the
|
||||
sys_futex(FUTEX_WAKE) syscall to wake them up. Once all waiters have
|
||||
taken and released the lock, the futex is again back to 'uncontended'
|
||||
state, and there's no in-kernel state associated with it. The kernel
|
||||
completely forgets that there ever was a futex at that address. This
|
||||
method makes futexes very lightweight and scalable.
|
||||
|
||||
"Robustness" is about dealing with crashes while holding a lock: if a
|
||||
process exits prematurely while holding a pthread_mutex_t lock that is
|
||||
also shared with some other process (e.g. yum segfaults while holding a
|
||||
pthread_mutex_t, or yum is kill -9-ed), then waiters for that lock need
|
||||
to be notified that the last owner of the lock exited in some irregular
|
||||
way.
|
||||
|
||||
To solve such types of problems, "robust mutex" userspace APIs were
|
||||
created: pthread_mutex_lock() returns an error value if the owner exits
|
||||
prematurely - and the new owner can decide whether the data protected by
|
||||
the lock can be recovered safely.
|
||||
|
||||
There is a big conceptual problem with futex based mutexes though: it is
|
||||
the kernel that destroys the owner task (e.g. due to a SEGFAULT), but
|
||||
the kernel cannot help with the cleanup: if there is no 'futex queue'
|
||||
(and in most cases there is none, futexes being fast lightweight locks)
|
||||
then the kernel has no information to clean up after the held lock!
|
||||
Userspace has no chance to clean up after the lock either - userspace is
|
||||
the one that crashes, so it has no opportunity to clean up. Catch-22.
|
||||
|
||||
In practice, when e.g. yum is kill -9-ed (or segfaults), a system reboot
|
||||
is needed to release that futex based lock. This is one of the leading
|
||||
bugreports against yum.
|
||||
|
||||
To solve this problem, the traditional approach was to extend the vma
|
||||
(virtual memory area descriptor) concept to have a notion of 'pending
|
||||
robust futexes attached to this area'. This approach requires 3 new
|
||||
syscall variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and
|
||||
FUTEX_RECOVER. At do_exit() time, all vmas are searched to see whether
|
||||
they have a robust_head set. This approach has two fundamental problems
|
||||
left:
|
||||
|
||||
- it has quite complex locking and race scenarios. The vma-based
|
||||
approach had been pending for years, but they are still not completely
|
||||
reliable.
|
||||
|
||||
- they have to scan _every_ vma at sys_exit() time, per thread!
|
||||
|
||||
The second disadvantage is a real killer: pthread_exit() takes around 1
|
||||
microsecond on Linux, but with thousands (or tens of thousands) of vmas
|
||||
every pthread_exit() takes a millisecond or more, also totally
|
||||
destroying the CPU's L1 and L2 caches!
|
||||
|
||||
This is very much noticeable even for normal process sys_exit_group()
|
||||
calls: the kernel has to do the vma scanning unconditionally! (this is
|
||||
because the kernel has no knowledge about how many robust futexes there
|
||||
are to be cleaned up, because a robust futex might have been registered
|
||||
in another task, and the futex variable might have been simply mmap()-ed
|
||||
into this process's address space).
|
||||
|
||||
This huge overhead forced the creation of CONFIG_FUTEX_ROBUST so that
|
||||
normal kernels can turn it off, but worse than that: the overhead makes
|
||||
robust futexes impractical for any type of generic Linux distribution.
|
||||
|
||||
So something had to be done.
|
||||
|
||||
New approach to robust futexes
|
||||
------------------------------
|
||||
|
||||
At the heart of this new approach there is a per-thread private list of
|
||||
robust locks that userspace is holding (maintained by glibc) - which
|
||||
userspace list is registered with the kernel via a new syscall [this
|
||||
registration happens at most once per thread lifetime]. At do_exit()
|
||||
time, the kernel checks this user-space list: are there any robust futex
|
||||
locks to be cleaned up?
|
||||
|
||||
In the common case, at do_exit() time, there is no list registered, so
|
||||
the cost of robust futexes is just a simple current->robust_list != NULL
|
||||
comparison. If the thread has registered a list, then normally the list
|
||||
is empty. If the thread/process crashed or terminated in some incorrect
|
||||
way then the list might be non-empty: in this case the kernel carefully
|
||||
walks the list [not trusting it], and marks all locks that are owned by
|
||||
this thread with the FUTEX_OWNER_DIED bit, and wakes up one waiter (if
|
||||
any).
|
||||
|
||||
The list is guaranteed to be private and per-thread at do_exit() time,
|
||||
so it can be accessed by the kernel in a lockless way.
|
||||
|
||||
There is one race possible though: since adding to and removing from the
|
||||
list is done after the futex is acquired by glibc, there is a few
|
||||
instructions window for the thread (or process) to die there, leaving
|
||||
the futex hung. To protect against this possibility, userspace (glibc)
|
||||
also maintains a simple per-thread 'list_op_pending' field, to allow the
|
||||
kernel to clean up if the thread dies after acquiring the lock, but just
|
||||
before it could have added itself to the list. Glibc sets this
|
||||
list_op_pending field before it tries to acquire the futex, and clears
|
||||
it after the list-add (or list-remove) has finished.
|
||||
|
||||
That's all that is needed - all the rest of robust-futex cleanup is done
|
||||
in userspace [just like with the previous patches].
|
||||
|
||||
Ulrich Drepper has implemented the necessary glibc support for this new
|
||||
mechanism, which fully enables robust mutexes.
|
||||
|
||||
Key differences of this userspace-list based approach, compared to the
|
||||
vma based method:
|
||||
|
||||
- it's much, much faster: at thread exit time, there's no need to loop
|
||||
over every vma (!), which the VM-based method has to do. Only a very
|
||||
simple 'is the list empty' op is done.
|
||||
|
||||
- no VM changes are needed - 'struct address_space' is left alone.
|
||||
|
||||
- no registration of individual locks is needed: robust mutexes don't
|
||||
need any extra per-lock syscalls. Robust mutexes thus become a very
|
||||
lightweight primitive - so they don't force the application designer
|
||||
to do a hard choice between performance and robustness - robust
|
||||
mutexes are just as fast.
|
||||
|
||||
- no per-lock kernel allocation happens.
|
||||
|
||||
- no resource limits are needed.
|
||||
|
||||
- no kernel-space recovery call (FUTEX_RECOVER) is needed.
|
||||
|
||||
- the implementation and the locking is "obvious", and there are no
|
||||
interactions with the VM.
|
||||
|
||||
Performance
|
||||
-----------
|
||||
|
||||
I have benchmarked the time needed for the kernel to process a list of 1
|
||||
million (!) held locks, using the new method [on a 2GHz CPU]:
|
||||
|
||||
- with FUTEX_WAIT set [contended mutex]: 130 msecs
|
||||
- without FUTEX_WAIT set [uncontended mutex]: 30 msecs
|
||||
|
||||
I have also measured an approach where glibc does the lock notification
|
||||
[which it currently does for !pshared robust mutexes], and that took 256
|
||||
msecs - clearly slower, due to the 1 million FUTEX_WAKE syscalls
|
||||
userspace had to do.
|
||||
|
||||
(1 million held locks are unheard of - we expect at most a handful of
|
||||
locks to be held at a time. Nevertheless it's nice to know that this
|
||||
approach scales nicely.)
|
||||
|
||||
Implementation details
|
||||
----------------------
|
||||
|
||||
The patch adds two new syscalls: one to register the userspace list, and
|
||||
one to query the registered list pointer::
|
||||
|
||||
asmlinkage long
|
||||
sys_set_robust_list(struct robust_list_head __user *head,
|
||||
size_t len);
|
||||
|
||||
asmlinkage long
|
||||
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
|
||||
size_t __user *len_ptr);
|
||||
|
||||
List registration is very fast: the pointer is simply stored in
|
||||
current->robust_list. [Note that in the future, if robust futexes become
|
||||
widespread, we could extend sys_clone() to register a robust-list head
|
||||
for new threads, without the need of another syscall.]
|
||||
|
||||
So there is virtually zero overhead for tasks not using robust futexes,
|
||||
and even for robust futex users, there is only one extra syscall per
|
||||
thread lifetime, and the cleanup operation, if it happens, is fast and
|
||||
straightforward. The kernel doesn't have any internal distinction between
|
||||
robust and normal futexes.
|
||||
|
||||
If a futex is found to be held at exit time, the kernel sets the
|
||||
following bit of the futex word::
|
||||
|
||||
#define FUTEX_OWNER_DIED 0x40000000
|
||||
|
||||
and wakes up the next futex waiter (if any). User-space does the rest of
|
||||
the cleanup.
|
||||
|
||||
Otherwise, robust futexes are acquired by glibc by putting the TID into
|
||||
the futex field atomically. Waiters set the FUTEX_WAITERS bit::
|
||||
|
||||
#define FUTEX_WAITERS 0x80000000
|
||||
|
||||
and the remaining bits are for the TID.
|
||||
|
||||
Testing, architecture support
|
||||
-----------------------------
|
||||
|
||||
I've tested the new syscalls on x86 and x86_64, and have made sure the
|
||||
parsing of the userspace list is robust [ ;-) ] even if the list is
|
||||
deliberately corrupted.
|
||||
|
||||
i386 and x86_64 syscalls are wired up at the moment, and Ulrich has
|
||||
tested the new glibc code (on x86_64 and i386), and it works for his
|
||||
robust-mutex testcases.
|
||||
|
||||
All other architectures should build just fine too - but they won't have
|
||||
the new syscalls yet.
|
||||
|
||||
Architectures need to implement the new futex_atomic_cmpxchg_inatomic()
|
||||
inline function before writing up the syscalls.
|
@@ -4,7 +4,7 @@ RT-mutex subsystem with PI support
|
||||
|
||||
RT-mutexes with priority inheritance are used to support PI-futexes,
|
||||
which enable pthread_mutex_t priority inheritance attributes
|
||||
(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details
|
||||
(PTHREAD_PRIO_INHERIT). [See Documentation/locking/pi-futex.rst for more details
|
||||
about PI-futexes.]
|
||||
|
||||
This technology was developed in the -rt tree and streamlined for
|
||||
|
Посилання в новій задачі
Заблокувати користувача