Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v5.9 RCU bits from Paul E. McKenney: - Documentation updates - Miscellaneous fixes - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
@@ -2583,7 +2583,12 @@ not work to have these markers in the trampoline itself, because there
|
||||
would need to be instructions following ``rcu_read_unlock()``. Although
|
||||
``synchronize_rcu()`` would guarantee that execution reached the
|
||||
``rcu_read_unlock()``, it would not be able to guarantee that execution
|
||||
had completely left the trampoline.
|
||||
had completely left the trampoline. Worse yet, in some situations
|
||||
the trampoline's protection must extend a few instructions *prior* to
|
||||
execution reaching the trampoline. For example, these few instructions
|
||||
might calculate the address of the trampoline, so that entering the
|
||||
trampoline would be pre-ordained a surprisingly long time before execution
|
||||
actually reached the trampoline itself.
|
||||
|
||||
The solution, in the form of `Tasks
|
||||
RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
|
||||
|
@@ -1,4 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
================================
|
||||
Review Checklist for RCU Patches
|
||||
================================
|
||||
|
||||
|
||||
This document contains a checklist for producing and reviewing patches
|
||||
@@ -411,18 +415,21 @@ over a rather long period of time, but improvements are always welcome!
|
||||
__rcu sparse checks to validate your RCU code. These can help
|
||||
find problems as follows:
|
||||
|
||||
CONFIG_PROVE_LOCKING: check that accesses to RCU-protected data
|
||||
CONFIG_PROVE_LOCKING:
|
||||
check that accesses to RCU-protected data
|
||||
structures are carried out under the proper RCU
|
||||
read-side critical section, while holding the right
|
||||
combination of locks, or whatever other conditions
|
||||
are appropriate.
|
||||
|
||||
CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the
|
||||
CONFIG_DEBUG_OBJECTS_RCU_HEAD:
|
||||
check that you don't pass the
|
||||
same object to call_rcu() (or friends) before an RCU
|
||||
grace period has elapsed since the last time that you
|
||||
passed that same object to call_rcu() (or friends).
|
||||
|
||||
__rcu sparse checks: tag the pointer to the RCU-protected data
|
||||
__rcu sparse checks:
|
||||
tag the pointer to the RCU-protected data
|
||||
structure with __rcu, and sparse will warn you if you
|
||||
access that pointer without the services of one of the
|
||||
variants of rcu_dereference().
|
||||
@@ -442,8 +449,8 @@ over a rather long period of time, but improvements are always welcome!
|
||||
|
||||
You instead need to use one of the barrier functions:
|
||||
|
||||
o call_rcu() -> rcu_barrier()
|
||||
o call_srcu() -> srcu_barrier()
|
||||
- call_rcu() -> rcu_barrier()
|
||||
- call_srcu() -> srcu_barrier()
|
||||
|
||||
However, these barrier functions are absolutely -not- guaranteed
|
||||
to wait for a grace period. In fact, if there are no call_rcu()
|
@@ -1,3 +1,5 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
.. _rcu_concepts:
|
||||
|
||||
============
|
||||
@@ -8,10 +10,17 @@ RCU concepts
|
||||
:maxdepth: 3
|
||||
|
||||
arrayRCU
|
||||
checklist
|
||||
lockdep
|
||||
lockdep-splat
|
||||
rcubarrier
|
||||
rcu_dereference
|
||||
whatisRCU
|
||||
rcu
|
||||
rculist_nulls
|
||||
rcuref
|
||||
torture
|
||||
stallwarn
|
||||
listRCU
|
||||
NMI-RCU
|
||||
UP
|
||||
|
@@ -1,3 +1,9 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================
|
||||
Lockdep-RCU Splat
|
||||
=================
|
||||
|
||||
Lockdep-RCU was added to the Linux kernel in early 2010
|
||||
(http://lwn.net/Articles/371986/). This facility checks for some common
|
||||
misuses of the RCU API, most notably using one of the rcu_dereference()
|
||||
@@ -12,55 +18,54 @@ overwriting or worse. There can of course be false positives, this
|
||||
being the real world and all that.
|
||||
|
||||
So let's look at an example RCU lockdep splat from 3.0-rc5, one that
|
||||
has long since been fixed:
|
||||
has long since been fixed::
|
||||
|
||||
=============================
|
||||
WARNING: suspicious RCU usage
|
||||
-----------------------------
|
||||
block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage!
|
||||
=============================
|
||||
WARNING: suspicious RCU usage
|
||||
-----------------------------
|
||||
block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage!
|
||||
|
||||
other info that might help us debug this:
|
||||
other info that might help us debug this::
|
||||
|
||||
rcu_scheduler_active = 1, debug_locks = 0
|
||||
3 locks held by scsi_scan_6/1552:
|
||||
#0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>]
|
||||
scsi_scan_host_selected+0x5a/0x150
|
||||
#1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>]
|
||||
elevator_exit+0x22/0x60
|
||||
#2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>]
|
||||
cfq_exit_queue+0x43/0x190
|
||||
|
||||
rcu_scheduler_active = 1, debug_locks = 0
|
||||
3 locks held by scsi_scan_6/1552:
|
||||
#0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>]
|
||||
scsi_scan_host_selected+0x5a/0x150
|
||||
#1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>]
|
||||
elevator_exit+0x22/0x60
|
||||
#2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>]
|
||||
cfq_exit_queue+0x43/0x190
|
||||
stack backtrace:
|
||||
Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17
|
||||
Call Trace:
|
||||
[<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0
|
||||
[<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120
|
||||
[<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190
|
||||
[<ffffffff812a5046>] elevator_exit+0x36/0x60
|
||||
[<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60
|
||||
[<ffffffff8145cc09>] scsi_free_queue+0x9/0x10
|
||||
[<ffffffff81460944>] __scsi_remove_device+0x84/0xd0
|
||||
[<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10
|
||||
[<ffffffff817da069>] ? error_exit+0x29/0xb0
|
||||
[<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
|
||||
[<ffffffff8145e722>] __scsi_scan_target+0x112/0x680
|
||||
[<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
|
||||
[<ffffffff817da069>] ? error_exit+0x29/0xb0
|
||||
[<ffffffff812bcc60>] ? kobject_del+0x40/0x40
|
||||
[<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0
|
||||
[<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150
|
||||
[<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90
|
||||
[<ffffffff8145f170>] do_scan_async+0x20/0x160
|
||||
[<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90
|
||||
[<ffffffff810975b6>] kthread+0xa6/0xb0
|
||||
[<ffffffff817db154>] kernel_thread_helper+0x4/0x10
|
||||
[<ffffffff81066430>] ? finish_task_switch+0x80/0x110
|
||||
[<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe
|
||||
[<ffffffff81097510>] ? __kthread_init_worker+0x70/0x70
|
||||
[<ffffffff817db150>] ? gs_change+0xb/0xb
|
||||
|
||||
stack backtrace:
|
||||
Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17
|
||||
Call Trace:
|
||||
[<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0
|
||||
[<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120
|
||||
[<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190
|
||||
[<ffffffff812a5046>] elevator_exit+0x36/0x60
|
||||
[<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60
|
||||
[<ffffffff8145cc09>] scsi_free_queue+0x9/0x10
|
||||
[<ffffffff81460944>] __scsi_remove_device+0x84/0xd0
|
||||
[<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10
|
||||
[<ffffffff817da069>] ? error_exit+0x29/0xb0
|
||||
[<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
|
||||
[<ffffffff8145e722>] __scsi_scan_target+0x112/0x680
|
||||
[<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
|
||||
[<ffffffff817da069>] ? error_exit+0x29/0xb0
|
||||
[<ffffffff812bcc60>] ? kobject_del+0x40/0x40
|
||||
[<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0
|
||||
[<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150
|
||||
[<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90
|
||||
[<ffffffff8145f170>] do_scan_async+0x20/0x160
|
||||
[<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90
|
||||
[<ffffffff810975b6>] kthread+0xa6/0xb0
|
||||
[<ffffffff817db154>] kernel_thread_helper+0x4/0x10
|
||||
[<ffffffff81066430>] ? finish_task_switch+0x80/0x110
|
||||
[<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe
|
||||
[<ffffffff81097510>] ? __kthread_init_worker+0x70/0x70
|
||||
[<ffffffff817db150>] ? gs_change+0xb/0xb
|
||||
|
||||
Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows:
|
||||
Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows::
|
||||
|
||||
if (rcu_dereference(ioc->ioc_data) == cic) {
|
||||
|
||||
@@ -70,7 +75,7 @@ case. Instead, we hold three locks, one of which might be RCU related.
|
||||
And maybe that lock really does protect this reference. If so, the fix
|
||||
is to inform RCU, perhaps by changing __cfq_exit_single_io_context() to
|
||||
take the struct request_queue "q" from cfq_exit_queue() as an argument,
|
||||
which would permit us to invoke rcu_dereference_protected as follows:
|
||||
which would permit us to invoke rcu_dereference_protected as follows::
|
||||
|
||||
if (rcu_dereference_protected(ioc->ioc_data,
|
||||
lockdep_is_held(&q->queue_lock)) == cic) {
|
||||
@@ -85,7 +90,7 @@ On the other hand, perhaps we really do need an RCU read-side critical
|
||||
section. In this case, the critical section must span the use of the
|
||||
return value from rcu_dereference(), or at least until there is some
|
||||
reference count incremented or some such. One way to handle this is to
|
||||
add rcu_read_lock() and rcu_read_unlock() as follows:
|
||||
add rcu_read_lock() and rcu_read_unlock() as follows::
|
||||
|
||||
rcu_read_lock();
|
||||
if (rcu_dereference(ioc->ioc_data) == cic) {
|
||||
@@ -102,7 +107,7 @@ above lockdep-RCU splat.
|
||||
But in this particular case, we don't actually dereference the pointer
|
||||
returned from rcu_dereference(). Instead, that pointer is just compared
|
||||
to the cic pointer, which means that the rcu_dereference() can be replaced
|
||||
by rcu_access_pointer() as follows:
|
||||
by rcu_access_pointer() as follows::
|
||||
|
||||
if (rcu_access_pointer(ioc->ioc_data) == cic) {
|
||||
|
@@ -1,4 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
========================
|
||||
RCU and lockdep checking
|
||||
========================
|
||||
|
||||
All flavors of RCU have lockdep checking available, so that lockdep is
|
||||
aware of when each task enters and leaves any flavor of RCU read-side
|
||||
@@ -8,7 +12,7 @@ tracking to include RCU state, which can sometimes help when debugging
|
||||
deadlocks and the like.
|
||||
|
||||
In addition, RCU provides the following primitives that check lockdep's
|
||||
state:
|
||||
state::
|
||||
|
||||
rcu_read_lock_held() for normal RCU.
|
||||
rcu_read_lock_bh_held() for RCU-bh.
|
||||
@@ -63,7 +67,7 @@ checking of rcu_dereference() primitives:
|
||||
The rcu_dereference_check() check expression can be any boolean
|
||||
expression, but would normally include a lockdep expression. However,
|
||||
any boolean expression can be used. For a moderately ornate example,
|
||||
consider the following:
|
||||
consider the following::
|
||||
|
||||
file = rcu_dereference_check(fdt->fd[fd],
|
||||
lockdep_is_held(&files->file_lock) ||
|
||||
@@ -82,7 +86,7 @@ RCU read-side critical sections, in case (2) the ->file_lock prevents
|
||||
any change from taking place, and finally, in case (3) the current task
|
||||
is the only task accessing the file_struct, again preventing any change
|
||||
from taking place. If the above statement was invoked only from updater
|
||||
code, it could instead be written as follows:
|
||||
code, it could instead be written as follows::
|
||||
|
||||
file = rcu_dereference_protected(fdt->fd[fd],
|
||||
lockdep_is_held(&files->file_lock) ||
|
||||
@@ -105,7 +109,7 @@ false and they are called from outside any RCU read-side critical section.
|
||||
|
||||
For example, the workqueue for_each_pwq() macro is intended to be used
|
||||
either within an RCU read-side critical section or with wq->mutex held.
|
||||
It is thus implemented as follows:
|
||||
It is thus implemented as follows::
|
||||
|
||||
#define for_each_pwq(pwq, wq)
|
||||
list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node,
|
200
Documentation/RCU/rculist_nulls.rst
Normal file
200
Documentation/RCU/rculist_nulls.rst
Normal file
@@ -0,0 +1,200 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================================================
|
||||
Using RCU hlist_nulls to protect list and objects
|
||||
=================================================
|
||||
|
||||
This section describes how to use hlist_nulls to
|
||||
protect read-mostly linked lists and
|
||||
objects using SLAB_TYPESAFE_BY_RCU allocations.
|
||||
|
||||
Please read the basics in Documentation/RCU/listRCU.rst
|
||||
|
||||
Using 'nulls'
|
||||
=============
|
||||
|
||||
Using special makers (called 'nulls') is a convenient way
|
||||
to solve following problem :
|
||||
|
||||
A typical RCU linked list managing objects which are
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
|
||||
use following algos :
|
||||
|
||||
1) Lookup algo
|
||||
--------------
|
||||
|
||||
::
|
||||
|
||||
rcu_read_lock()
|
||||
begin:
|
||||
obj = lockless_lookup(key);
|
||||
if (obj) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
goto begin;
|
||||
/*
|
||||
* Because a writer could delete object, and a writer could
|
||||
* reuse these object before the RCU grace period, we
|
||||
* must check key after getting the reference on object
|
||||
*/
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
goto begin;
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
|
||||
Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
|
||||
but a version with an additional memory barrier (smp_rmb())
|
||||
|
||||
::
|
||||
|
||||
lockless_lookup(key)
|
||||
{
|
||||
struct hlist_node *node, *next;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb()::
|
||||
|
||||
struct hlist_node *node;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ prefetch(pos->next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(pos->next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
|
||||
Quoting Corey Minyard::
|
||||
|
||||
"If the object is moved from one list to another list in-between the
|
||||
time the hash is calculated and the next field is accessed, and the
|
||||
object has moved to the end of a new list, the traversal will not
|
||||
complete properly on the list it should have, since the object will
|
||||
be on the end of the new list and there's not a way to tell it's on a
|
||||
new list and restart the list traversal. I think that this can be
|
||||
solved by pre-fetching the "next" field (with proper barriers) before
|
||||
checking the key."
|
||||
|
||||
2) Insert algo
|
||||
--------------
|
||||
|
||||
We need to make sure a reader cannot read the new 'obj->obj_next' value
|
||||
and previous value of 'obj->key'. Or else, an item could be deleted
|
||||
from a chain, and inserted into another chain. If new chain was empty
|
||||
before the move, 'next' pointer is NULL, and lockless reader can
|
||||
not detect it missed following items in original chain.
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(...);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
/*
|
||||
* we need to make sure obj->key is updated before obj->next
|
||||
* or obj->refcnt
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
hlist_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
|
||||
|
||||
3) Remove algo
|
||||
--------------
|
||||
Nothing special here, we can use a standard RCU hlist deletion.
|
||||
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
|
||||
very very fast (before the end of RCU grace period)
|
||||
|
||||
::
|
||||
|
||||
if (put_last_reference_on(obj) {
|
||||
lock_chain(); // typically a spin_lock()
|
||||
hlist_del_init_rcu(&obj->obj_node);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
kmem_cache_free(cachep, obj);
|
||||
}
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------
|
||||
|
||||
Avoiding extra smp_rmb()
|
||||
========================
|
||||
|
||||
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
|
||||
and extra smp_wmb() in insert function.
|
||||
|
||||
For example, if we choose to store the slot number as the 'nulls'
|
||||
end-of-list marker for each slot of the hash table, we can detect
|
||||
a race (some writer did a delete and/or a move of an object
|
||||
to another chain) checking the final 'nulls' value if
|
||||
the lookup met the end of chain. If final 'nulls' value
|
||||
is not the slot number, then we must restart the lookup at
|
||||
the beginning. If the object was moved to the same chain,
|
||||
then the reader doesn't care : It might eventually
|
||||
scan the list again without harm.
|
||||
|
||||
|
||||
1) lookup algo
|
||||
--------------
|
||||
|
||||
::
|
||||
|
||||
head = &table[slot];
|
||||
rcu_read_lock();
|
||||
begin:
|
||||
hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
|
||||
if (obj->key == key) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
goto begin;
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
goto begin;
|
||||
}
|
||||
goto out;
|
||||
}
|
||||
/*
|
||||
* if the nulls value we got at the end of this lookup is
|
||||
* not the expected one, we must restart lookup.
|
||||
* We probably met an item that was moved to another chain.
|
||||
*/
|
||||
if (get_nulls_value(node) != slot)
|
||||
goto begin;
|
||||
obj = NULL;
|
||||
|
||||
out:
|
||||
rcu_read_unlock();
|
||||
|
||||
2) Insert function
|
||||
------------------
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(cachep);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
/*
|
||||
* changes to obj->key must be visible before refcnt one
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
/*
|
||||
* insert obj in RCU way (readers might be traversing chain)
|
||||
*/
|
||||
hlist_nulls_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
@@ -1,172 +0,0 @@
|
||||
Using hlist_nulls to protect read-mostly linked lists and
|
||||
objects using SLAB_TYPESAFE_BY_RCU allocations.
|
||||
|
||||
Please read the basics in Documentation/RCU/listRCU.rst
|
||||
|
||||
Using special makers (called 'nulls') is a convenient way
|
||||
to solve following problem :
|
||||
|
||||
A typical RCU linked list managing objects which are
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
|
||||
use following algos :
|
||||
|
||||
1) Lookup algo
|
||||
--------------
|
||||
rcu_read_lock()
|
||||
begin:
|
||||
obj = lockless_lookup(key);
|
||||
if (obj) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
goto begin;
|
||||
/*
|
||||
* Because a writer could delete object, and a writer could
|
||||
* reuse these object before the RCU grace period, we
|
||||
* must check key after getting the reference on object
|
||||
*/
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
goto begin;
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
|
||||
Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
|
||||
but a version with an additional memory barrier (smp_rmb())
|
||||
|
||||
lockless_lookup(key)
|
||||
{
|
||||
struct hlist_node *node, *next;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
|
||||
And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() :
|
||||
|
||||
struct hlist_node *node;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ prefetch(pos->next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(pos->next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
Quoting Corey Minyard :
|
||||
|
||||
"If the object is moved from one list to another list in-between the
|
||||
time the hash is calculated and the next field is accessed, and the
|
||||
object has moved to the end of a new list, the traversal will not
|
||||
complete properly on the list it should have, since the object will
|
||||
be on the end of the new list and there's not a way to tell it's on a
|
||||
new list and restart the list traversal. I think that this can be
|
||||
solved by pre-fetching the "next" field (with proper barriers) before
|
||||
checking the key."
|
||||
|
||||
2) Insert algo :
|
||||
----------------
|
||||
|
||||
We need to make sure a reader cannot read the new 'obj->obj_next' value
|
||||
and previous value of 'obj->key'. Or else, an item could be deleted
|
||||
from a chain, and inserted into another chain. If new chain was empty
|
||||
before the move, 'next' pointer is NULL, and lockless reader can
|
||||
not detect it missed following items in original chain.
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(...);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
/*
|
||||
* we need to make sure obj->key is updated before obj->next
|
||||
* or obj->refcnt
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
hlist_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
|
||||
|
||||
3) Remove algo
|
||||
--------------
|
||||
Nothing special here, we can use a standard RCU hlist deletion.
|
||||
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
|
||||
very very fast (before the end of RCU grace period)
|
||||
|
||||
if (put_last_reference_on(obj) {
|
||||
lock_chain(); // typically a spin_lock()
|
||||
hlist_del_init_rcu(&obj->obj_node);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
kmem_cache_free(cachep, obj);
|
||||
}
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------
|
||||
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
|
||||
and extra smp_wmb() in insert function.
|
||||
|
||||
For example, if we choose to store the slot number as the 'nulls'
|
||||
end-of-list marker for each slot of the hash table, we can detect
|
||||
a race (some writer did a delete and/or a move of an object
|
||||
to another chain) checking the final 'nulls' value if
|
||||
the lookup met the end of chain. If final 'nulls' value
|
||||
is not the slot number, then we must restart the lookup at
|
||||
the beginning. If the object was moved to the same chain,
|
||||
then the reader doesn't care : It might eventually
|
||||
scan the list again without harm.
|
||||
|
||||
|
||||
1) lookup algo
|
||||
|
||||
head = &table[slot];
|
||||
rcu_read_lock();
|
||||
begin:
|
||||
hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
|
||||
if (obj->key == key) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
goto begin;
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
goto begin;
|
||||
}
|
||||
goto out;
|
||||
}
|
||||
/*
|
||||
* if the nulls value we got at the end of this lookup is
|
||||
* not the expected one, we must restart lookup.
|
||||
* We probably met an item that was moved to another chain.
|
||||
*/
|
||||
if (get_nulls_value(node) != slot)
|
||||
goto begin;
|
||||
obj = NULL;
|
||||
|
||||
out:
|
||||
rcu_read_unlock();
|
||||
|
||||
2) Insert function :
|
||||
--------------------
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(cachep);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
/*
|
||||
* changes to obj->key must be visible before refcnt one
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
/*
|
||||
* insert obj in RCU way (readers might be traversing chain)
|
||||
*/
|
||||
hlist_nulls_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
@@ -1,4 +1,8 @@
|
||||
Reference-count design for elements of lists/arrays protected by RCU.
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
====================================================================
|
||||
Reference-count design for elements of lists/arrays protected by RCU
|
||||
====================================================================
|
||||
|
||||
|
||||
Please note that the percpu-ref feature is likely your first
|
||||
@@ -12,32 +16,33 @@ please read on.
|
||||
Reference counting on elements of lists which are protected by traditional
|
||||
reader/writer spinlocks or semaphores are straightforward:
|
||||
|
||||
CODE LISTING A:
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object read_lock(&list_lock);
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
write_lock(&list_lock); ...
|
||||
add_element read_unlock(&list_lock);
|
||||
... ...
|
||||
write_unlock(&list_lock); }
|
||||
}
|
||||
CODE LISTING A::
|
||||
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
if(atomic_dec_and_test(&el->rc)) ...
|
||||
kfree(el);
|
||||
... remove_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object read_lock(&list_lock);
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
write_lock(&list_lock); ...
|
||||
add_element read_unlock(&list_lock);
|
||||
... ...
|
||||
write_unlock(&list_lock); }
|
||||
}
|
||||
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
if(atomic_dec_and_test(&el->rc)) ...
|
||||
kfree(el);
|
||||
... remove_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
|
||||
If this list/array is made lock free using RCU as in changing the
|
||||
write_lock() in add() and delete() to spin_lock() and changing read_lock()
|
||||
@@ -46,34 +51,35 @@ search_and_reference() could potentially hold reference to an element which
|
||||
has already been deleted from the list/array. Use atomic_inc_not_zero()
|
||||
in this scenario as follows:
|
||||
|
||||
CODE LISTING B:
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) {
|
||||
spin_lock(&list_lock); rcu_read_unlock();
|
||||
return FAIL;
|
||||
add_element }
|
||||
... ...
|
||||
spin_unlock(&list_lock); rcu_read_unlock();
|
||||
} }
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... spin_lock(&list_lock);
|
||||
if (atomic_dec_and_test(&el->rc)) ...
|
||||
call_rcu(&el->head, el_free); remove_element
|
||||
... spin_unlock(&list_lock);
|
||||
} ...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
call_rcu(&el->head, el_free);
|
||||
...
|
||||
}
|
||||
CODE LISTING B::
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) {
|
||||
spin_lock(&list_lock); rcu_read_unlock();
|
||||
return FAIL;
|
||||
add_element }
|
||||
... ...
|
||||
spin_unlock(&list_lock); rcu_read_unlock();
|
||||
} }
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... spin_lock(&list_lock);
|
||||
if (atomic_dec_and_test(&el->rc)) ...
|
||||
call_rcu(&el->head, el_free); remove_element
|
||||
... spin_unlock(&list_lock);
|
||||
} ...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
call_rcu(&el->head, el_free);
|
||||
...
|
||||
}
|
||||
|
||||
Sometimes, a reference to the element needs to be obtained in the
|
||||
update (write) stream. In such cases, atomic_inc_not_zero() might be
|
||||
update (write) stream. In such cases, atomic_inc_not_zero() might be
|
||||
overkill, since we hold the update-side spinlock. One might instead
|
||||
use atomic_inc() in such cases.
|
||||
|
||||
@@ -82,39 +88,40 @@ search_and_reference() code path. In such cases, the
|
||||
atomic_dec_and_test() may be moved from delete() to el_free()
|
||||
as follows:
|
||||
|
||||
CODE LISTING C:
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
spin_lock(&list_lock); ...
|
||||
CODE LISTING C::
|
||||
|
||||
add_element rcu_read_unlock();
|
||||
... }
|
||||
spin_unlock(&list_lock); 4.
|
||||
} delete()
|
||||
3. {
|
||||
release_referenced() spin_lock(&list_lock);
|
||||
{ ...
|
||||
... remove_element
|
||||
if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock);
|
||||
kfree(el); ...
|
||||
... call_rcu(&el->head, el_free);
|
||||
} ...
|
||||
5. }
|
||||
void el_free(struct rcu_head *rhp)
|
||||
{
|
||||
release_referenced();
|
||||
}
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
spin_lock(&list_lock); ...
|
||||
|
||||
add_element rcu_read_unlock();
|
||||
... }
|
||||
spin_unlock(&list_lock); 4.
|
||||
} delete()
|
||||
3. {
|
||||
release_referenced() spin_lock(&list_lock);
|
||||
{ ...
|
||||
... remove_element
|
||||
if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock);
|
||||
kfree(el); ...
|
||||
... call_rcu(&el->head, el_free);
|
||||
} ...
|
||||
5. }
|
||||
void el_free(struct rcu_head *rhp)
|
||||
{
|
||||
release_referenced();
|
||||
}
|
||||
|
||||
The key point is that the initial reference added by add() is not removed
|
||||
until after a grace period has elapsed following removal. This means that
|
||||
search_and_reference() cannot find this element, which means that the value
|
||||
of el->rc cannot increase. Thus, once it reaches zero, there are no
|
||||
readers that can or ever will be able to reference the element. The
|
||||
element can therefore safely be freed. This in turn guarantees that if
|
||||
readers that can or ever will be able to reference the element. The
|
||||
element can therefore safely be freed. This in turn guarantees that if
|
||||
any reader finds the element, that reader may safely acquire a reference
|
||||
without checking the value of the reference counter.
|
||||
|
||||
@@ -130,21 +137,21 @@ the eventual invocation of kfree(), which is usually not a problem on
|
||||
modern computer systems, even the small ones.
|
||||
|
||||
In cases where delete() can sleep, synchronize_rcu() can be called from
|
||||
delete(), so that el_free() can be subsumed into delete as follows:
|
||||
delete(), so that el_free() can be subsumed into delete as follows::
|
||||
|
||||
4.
|
||||
delete()
|
||||
{
|
||||
spin_lock(&list_lock);
|
||||
...
|
||||
remove_element
|
||||
spin_unlock(&list_lock);
|
||||
...
|
||||
synchronize_rcu();
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
4.
|
||||
delete()
|
||||
{
|
||||
spin_lock(&list_lock);
|
||||
...
|
||||
remove_element
|
||||
spin_unlock(&list_lock);
|
||||
...
|
||||
synchronize_rcu();
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
|
||||
As additional examples in the kernel, the pattern in listing C is used by
|
||||
reference counting of struct pid, while the pattern in listing B is used by
|
@@ -1,4 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============================
|
||||
Using RCU's CPU Stall Detector
|
||||
==============================
|
||||
|
||||
This document first discusses what sorts of issues RCU's CPU stall
|
||||
detector can locate, and then discusses kernel parameters and Kconfig
|
||||
@@ -7,39 +11,40 @@ this document explains the stall detector's "splat" format.
|
||||
|
||||
|
||||
What Causes RCU CPU Stall Warnings?
|
||||
===================================
|
||||
|
||||
So your kernel printed an RCU CPU stall warning. The next question is
|
||||
"What caused it?" The following problems can result in RCU CPU stall
|
||||
warnings:
|
||||
|
||||
o A CPU looping in an RCU read-side critical section.
|
||||
- A CPU looping in an RCU read-side critical section.
|
||||
|
||||
o A CPU looping with interrupts disabled.
|
||||
- A CPU looping with interrupts disabled.
|
||||
|
||||
o A CPU looping with preemption disabled.
|
||||
- A CPU looping with preemption disabled.
|
||||
|
||||
o A CPU looping with bottom halves disabled.
|
||||
- A CPU looping with bottom halves disabled.
|
||||
|
||||
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
|
||||
- For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
|
||||
without invoking schedule(). If the looping in the kernel is
|
||||
really expected and desirable behavior, you might need to add
|
||||
some calls to cond_resched().
|
||||
|
||||
o Booting Linux using a console connection that is too slow to
|
||||
- Booting Linux using a console connection that is too slow to
|
||||
keep up with the boot-time console-message rate. For example,
|
||||
a 115Kbaud serial console can be -way- too slow to keep up
|
||||
with boot-time message rates, and will frequently result in
|
||||
RCU CPU stall warning messages. Especially if you have added
|
||||
debug printk()s.
|
||||
|
||||
o Anything that prevents RCU's grace-period kthreads from running.
|
||||
- Anything that prevents RCU's grace-period kthreads from running.
|
||||
This can result in the "All QSes seen" console-log message.
|
||||
This message will include information on when the kthread last
|
||||
ran and how often it should be expected to run. It can also
|
||||
result in the "rcu_.*kthread starved for" console-log message,
|
||||
result in the ``rcu_.*kthread starved for`` console-log message,
|
||||
which will include additional debugging information.
|
||||
|
||||
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
||||
- A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
||||
happen to preempt a low-priority task in the middle of an RCU
|
||||
read-side critical section. This is especially damaging if
|
||||
that low-priority task is not permitted to run on any other CPU,
|
||||
@@ -48,7 +53,7 @@ o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
||||
While the system is in the process of running itself out of
|
||||
memory, you might see stall-warning messages.
|
||||
|
||||
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
||||
- A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
||||
is running at a higher priority than the RCU softirq threads.
|
||||
This will prevent RCU callbacks from ever being invoked,
|
||||
and in a CONFIG_PREEMPT_RCU kernel will further prevent
|
||||
@@ -63,7 +68,7 @@ o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
||||
can increase your system's context-switch rate and thus degrade
|
||||
performance.
|
||||
|
||||
o A periodic interrupt whose handler takes longer than the time
|
||||
- A periodic interrupt whose handler takes longer than the time
|
||||
interval between successive pairs of interrupts. This can
|
||||
prevent RCU's kthreads and softirq handlers from running.
|
||||
Note that certain high-overhead debugging options, for example
|
||||
@@ -71,20 +76,27 @@ o A periodic interrupt whose handler takes longer than the time
|
||||
considerably longer than normal, which can in turn result in
|
||||
RCU CPU stall warnings.
|
||||
|
||||
o Testing a workload on a fast system, tuning the stall-warning
|
||||
- Testing a workload on a fast system, tuning the stall-warning
|
||||
timeout down to just barely avoid RCU CPU stall warnings, and then
|
||||
running the same workload with the same stall-warning timeout on a
|
||||
slow system. Note that thermal throttling and on-demand governors
|
||||
can cause a single system to be sometimes fast and sometimes slow!
|
||||
|
||||
o A hardware or software issue shuts off the scheduler-clock
|
||||
- A hardware or software issue shuts off the scheduler-clock
|
||||
interrupt on a CPU that is not in dyntick-idle mode. This
|
||||
problem really has happened, and seems to be most likely to
|
||||
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
|
||||
|
||||
o A bug in the RCU implementation.
|
||||
- A hardware or software issue that prevents time-based wakeups
|
||||
from occurring. These issues can range from misconfigured or
|
||||
buggy timer hardware through bugs in the interrupt or exception
|
||||
path (whether hardware, firmware, or software) through bugs
|
||||
in Linux's timer subsystem through bugs in the scheduler, and,
|
||||
yes, even including bugs in RCU itself.
|
||||
|
||||
o A hardware failure. This is quite unlikely, but has occurred
|
||||
- A bug in the RCU implementation.
|
||||
|
||||
- A hardware failure. This is quite unlikely, but has occurred
|
||||
at least once in real life. A CPU failed in a running system,
|
||||
becoming unresponsive, but not causing an immediate crash.
|
||||
This resulted in a series of RCU CPU stall warnings, eventually
|
||||
@@ -109,6 +121,7 @@ see include/trace/events/rcu.h.
|
||||
|
||||
|
||||
Fine-Tuning the RCU CPU Stall Detector
|
||||
======================================
|
||||
|
||||
The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's
|
||||
CPU stall detector, which detects conditions that unduly delay RCU grace
|
||||
@@ -118,6 +131,7 @@ The stall detector's idea of what constitutes "unduly delayed" is
|
||||
controlled by a set of kernel configuration variables and cpp macros:
|
||||
|
||||
CONFIG_RCU_CPU_STALL_TIMEOUT
|
||||
----------------------------
|
||||
|
||||
This kernel configuration parameter defines the period of time
|
||||
that RCU will wait from the beginning of a grace period until it
|
||||
@@ -137,6 +151,7 @@ CONFIG_RCU_CPU_STALL_TIMEOUT
|
||||
/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress.
|
||||
|
||||
RCU_STALL_DELAY_DELTA
|
||||
---------------------
|
||||
|
||||
Although the lockdep facility is extremely useful, it does add
|
||||
some overhead. Therefore, under CONFIG_PROVE_RCU, the
|
||||
@@ -145,6 +160,7 @@ RCU_STALL_DELAY_DELTA
|
||||
macro, not a kernel configuration parameter.)
|
||||
|
||||
RCU_STALL_RAT_DELAY
|
||||
-------------------
|
||||
|
||||
The CPU stall detector tries to make the offending CPU print its
|
||||
own warnings, as this often gives better-quality stack traces.
|
||||
@@ -155,6 +171,7 @@ RCU_STALL_RAT_DELAY
|
||||
parameter.)
|
||||
|
||||
rcupdate.rcu_task_stall_timeout
|
||||
-------------------------------
|
||||
|
||||
This boot/sysfs parameter controls the RCU-tasks stall warning
|
||||
interval. A value of zero or less suppresses RCU-tasks stall
|
||||
@@ -168,9 +185,10 @@ rcupdate.rcu_task_stall_timeout
|
||||
|
||||
|
||||
Interpreting RCU's CPU Stall-Detector "Splats"
|
||||
==============================================
|
||||
|
||||
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
|
||||
it will print a message similar to the following:
|
||||
it will print a message similar to the following::
|
||||
|
||||
INFO: rcu_sched detected stalls on CPUs/tasks:
|
||||
2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0
|
||||
@@ -223,7 +241,7 @@ an estimate of the total number of RCU callbacks queued across all CPUs
|
||||
(625 in this case).
|
||||
|
||||
In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed
|
||||
for each CPU:
|
||||
for each CPU::
|
||||
|
||||
0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 dyntick_enabled: 1
|
||||
|
||||
@@ -235,7 +253,7 @@ processing is enabled.
|
||||
|
||||
If the grace period ends just as the stall warning starts printing,
|
||||
there will be a spurious stall-warning message, which will include
|
||||
the following:
|
||||
the following::
|
||||
|
||||
INFO: Stall ended before state dump start
|
||||
|
||||
@@ -248,7 +266,7 @@ which is overkill for this sort of problem.
|
||||
|
||||
If all CPUs and tasks have passed through quiescent states, but the
|
||||
grace period has nevertheless failed to end, the stall-warning splat
|
||||
will include something like the following:
|
||||
will include something like the following::
|
||||
|
||||
All QSes seen, last rcu_preempt kthread activity 23807 (4297905177-4297881370), jiffies_till_next_fqs=3, root ->qsmask 0x0
|
||||
|
||||
@@ -261,7 +279,7 @@ which is way less than 23807. Finally, the root rcu_node structure's
|
||||
|
||||
If the relevant grace-period kthread has been unable to run prior to
|
||||
the stall warning, as was the case in the "All QSes seen" line above,
|
||||
the following additional line is printed:
|
||||
the following additional line is printed::
|
||||
|
||||
kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5
|
||||
|
||||
@@ -276,6 +294,7 @@ kthread last ran on CPU 5.
|
||||
|
||||
|
||||
Multiple Warnings From One Stall
|
||||
================================
|
||||
|
||||
If a stall lasts long enough, multiple stall-warning messages will be
|
||||
printed for it. The second and subsequent messages are printed at
|
||||
@@ -285,9 +304,10 @@ of the stall and the first message.
|
||||
|
||||
|
||||
Stall Warnings for Expedited Grace Periods
|
||||
==========================================
|
||||
|
||||
If an expedited grace period detects a stall, it will place a message
|
||||
like the following in dmesg:
|
||||
like the following in dmesg::
|
||||
|
||||
INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 7-... } 21119 jiffies s: 73 root: 0x2/.
|
||||
|
@@ -1,7 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
RCU Torture Test Operation
|
||||
==========================
|
||||
|
||||
|
||||
CONFIG_RCU_TORTURE_TEST
|
||||
=======================
|
||||
|
||||
The CONFIG_RCU_TORTURE_TEST config option is available for all RCU
|
||||
implementations. It creates an rcutorture kernel module that can
|
||||
@@ -13,9 +18,10 @@ when the module is loaded, and stops when the module is unloaded.
|
||||
Module parameters are prefixed by "rcutorture." in
|
||||
Documentation/admin-guide/kernel-parameters.txt.
|
||||
|
||||
OUTPUT
|
||||
Output
|
||||
======
|
||||
|
||||
The statistics output is as follows:
|
||||
The statistics output is as follows::
|
||||
|
||||
rcu-torture:--- Start of test: nreaders=16 nfakewriters=4 stat_interval=30 verbose=0 test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4
|
||||
rcu-torture: rtc: (null) ver: 155441 tfle: 0 rta: 155441 rtaf: 8884 rtf: 155440 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 3055767
|
||||
@@ -36,53 +42,53 @@ automatic determination as to whether RCU operated correctly.
|
||||
|
||||
The entries are as follows:
|
||||
|
||||
o "rtc": The hexadecimal address of the structure currently visible
|
||||
* "rtc": The hexadecimal address of the structure currently visible
|
||||
to readers.
|
||||
|
||||
o "ver": The number of times since boot that the RCU writer task
|
||||
* "ver": The number of times since boot that the RCU writer task
|
||||
has changed the structure visible to readers.
|
||||
|
||||
o "tfle": If non-zero, indicates that the "torture freelist"
|
||||
* "tfle": If non-zero, indicates that the "torture freelist"
|
||||
containing structures to be placed into the "rtc" area is empty.
|
||||
This condition is important, since it can fool you into thinking
|
||||
that RCU is working when it is not. :-/
|
||||
|
||||
o "rta": Number of structures allocated from the torture freelist.
|
||||
* "rta": Number of structures allocated from the torture freelist.
|
||||
|
||||
o "rtaf": Number of allocations from the torture freelist that have
|
||||
* "rtaf": Number of allocations from the torture freelist that have
|
||||
failed due to the list being empty. It is not unusual for this
|
||||
to be non-zero, but it is bad for it to be a large fraction of
|
||||
the value indicated by "rta".
|
||||
|
||||
o "rtf": Number of frees into the torture freelist.
|
||||
* "rtf": Number of frees into the torture freelist.
|
||||
|
||||
o "rtmbe": A non-zero value indicates that rcutorture believes that
|
||||
* "rtmbe": A non-zero value indicates that rcutorture believes that
|
||||
rcu_assign_pointer() and rcu_dereference() are not working
|
||||
correctly. This value should be zero.
|
||||
|
||||
o "rtbe": A non-zero value indicates that one of the rcu_barrier()
|
||||
* "rtbe": A non-zero value indicates that one of the rcu_barrier()
|
||||
family of functions is not working correctly.
|
||||
|
||||
o "rtbke": rcutorture was unable to create the real-time kthreads
|
||||
* "rtbke": rcutorture was unable to create the real-time kthreads
|
||||
used to force RCU priority inversion. This value should be zero.
|
||||
|
||||
o "rtbre": Although rcutorture successfully created the kthreads
|
||||
* "rtbre": Although rcutorture successfully created the kthreads
|
||||
used to force RCU priority inversion, it was unable to set them
|
||||
to the real-time priority level of 1. This value should be zero.
|
||||
|
||||
o "rtbf": The number of times that RCU priority boosting failed
|
||||
* "rtbf": The number of times that RCU priority boosting failed
|
||||
to resolve RCU priority inversion.
|
||||
|
||||
o "rtb": The number of times that rcutorture attempted to force
|
||||
* "rtb": The number of times that rcutorture attempted to force
|
||||
an RCU priority inversion condition. If you are testing RCU
|
||||
priority boosting via the "test_boost" module parameter, this
|
||||
value should be non-zero.
|
||||
|
||||
o "nt": The number of times rcutorture ran RCU read-side code from
|
||||
* "nt": The number of times rcutorture ran RCU read-side code from
|
||||
within a timer handler. This value should be non-zero only
|
||||
if you specified the "irqreader" module parameter.
|
||||
|
||||
o "Reader Pipe": Histogram of "ages" of structures seen by readers.
|
||||
* "Reader Pipe": Histogram of "ages" of structures seen by readers.
|
||||
If any entries past the first two are non-zero, RCU is broken.
|
||||
And rcutorture prints the error flag string "!!!" to make sure
|
||||
you notice. The age of a newly allocated structure is zero,
|
||||
@@ -94,14 +100,14 @@ o "Reader Pipe": Histogram of "ages" of structures seen by readers.
|
||||
RCU. If you want to see what it looks like when broken, break
|
||||
it yourself. ;-)
|
||||
|
||||
o "Reader Batch": Another histogram of "ages" of structures seen
|
||||
* "Reader Batch": Another histogram of "ages" of structures seen
|
||||
by readers, but in terms of counter flips (or batches) rather
|
||||
than in terms of grace periods. The legal number of non-zero
|
||||
entries is again two. The reason for this separate view is that
|
||||
it is sometimes easier to get the third entry to show up in the
|
||||
"Reader Batch" list than in the "Reader Pipe" list.
|
||||
|
||||
o "Free-Block Circulation": Shows the number of torture structures
|
||||
* "Free-Block Circulation": Shows the number of torture structures
|
||||
that have reached a given point in the pipeline. The first element
|
||||
should closely correspond to the number of structures allocated,
|
||||
the second to the number that have been removed from reader view,
|
||||
@@ -112,7 +118,7 @@ o "Free-Block Circulation": Shows the number of torture structures
|
||||
|
||||
Different implementations of RCU can provide implementation-specific
|
||||
additional information. For example, Tree SRCU provides the following
|
||||
additional line:
|
||||
additional line::
|
||||
|
||||
srcud-torture: Tree SRCU per-CPU(idx=0): 0(35,-21) 1(-4,24) 2(1,1) 3(-26,20) 4(28,-47) 5(-9,4) 6(-10,14) 7(-14,11) T(1,6)
|
||||
|
||||
@@ -123,15 +129,15 @@ using a dynamically allocated srcu_struct (hence "srcud-" rather than
|
||||
"old" and "current" values to the underlying array, and is useful for
|
||||
debugging. The final "T" entry contains the totals of the counters.
|
||||
|
||||
|
||||
USAGE ON SPECIFIC KERNEL BUILDS
|
||||
Usage on Specific Kernel Builds
|
||||
===============================
|
||||
|
||||
It is sometimes desirable to torture RCU on a specific kernel build,
|
||||
for example, when preparing to put that kernel build into production.
|
||||
In that case, the kernel should be built with CONFIG_RCU_TORTURE_TEST=m
|
||||
so that the test can be started using modprobe and terminated using rmmod.
|
||||
|
||||
For example, the following script may be used to torture RCU:
|
||||
For example, the following script may be used to torture RCU::
|
||||
|
||||
#!/bin/sh
|
||||
|
||||
@@ -148,7 +154,8 @@ two are self-explanatory, while the last indicates that while there
|
||||
were no RCU failures, CPU-hotplug problems were detected.
|
||||
|
||||
|
||||
USAGE ON MAINLINE KERNELS
|
||||
Usage on Mainline Kernels
|
||||
=========================
|
||||
|
||||
When using rcutorture to test changes to RCU itself, it is often
|
||||
necessary to build a number of kernels in order to test that change
|
||||
@@ -180,16 +187,16 @@ to Tree SRCU might run only the SRCU-N and SRCU-P scenarios using the
|
||||
--configs argument to kvm.sh as follows: "--configs 'SRCU-N SRCU-P'".
|
||||
Large systems can run multiple copies of of the full set of scenarios,
|
||||
for example, a system with 448 hardware threads can run five instances
|
||||
of the full set concurrently. To make this happen:
|
||||
of the full set concurrently. To make this happen::
|
||||
|
||||
kvm.sh --cpus 448 --configs '5*CFLIST'
|
||||
|
||||
Alternatively, such a system can run 56 concurrent instances of a single
|
||||
eight-CPU scenario:
|
||||
eight-CPU scenario::
|
||||
|
||||
kvm.sh --cpus 448 --configs '56*TREE04'
|
||||
|
||||
Or 28 concurrent instances of each of two eight-CPU scenarios:
|
||||
Or 28 concurrent instances of each of two eight-CPU scenarios::
|
||||
|
||||
kvm.sh --cpus 448 --configs '28*TREE03 28*TREE04'
|
||||
|
||||
@@ -199,14 +206,14 @@ values for memory may require disabling the callback-flooding tests
|
||||
using the --bootargs parameter discussed below.
|
||||
|
||||
Sometimes additional debugging is useful, and in such cases the --kconfig
|
||||
parameter to kvm.sh may be used, for example, "--kconfig 'CONFIG_KASAN=y'".
|
||||
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_KASAN=y'``.
|
||||
|
||||
Kernel boot arguments can also be supplied, for example, to control
|
||||
rcutorture's module parameters. For example, to test a change to RCU's
|
||||
CPU stall-warning code, use "--bootargs 'rcutorture.stall_cpu=30'".
|
||||
This will of course result in the scripting reporting a failure, namely
|
||||
the resuling RCU CPU stall warning. As noted above, reducing memory may
|
||||
require disabling rcutorture's callback-flooding tests:
|
||||
require disabling rcutorture's callback-flooding tests::
|
||||
|
||||
kvm.sh --cpus 448 --configs '56*TREE04' --memory 128M \
|
||||
--bootargs 'rcutorture.fwd_progress=0'
|
||||
@@ -225,7 +232,7 @@ is listed at the end of the kvm.sh output, which you really should redirect
|
||||
to a file. The build products and console output of each run is kept in
|
||||
tools/testing/selftests/rcutorture/res in timestamped directories. A
|
||||
given directory can be supplied to kvm-find-errors.sh in order to have
|
||||
it cycle you through summaries of errors and full error logs. For example:
|
||||
it cycle you through summaries of errors and full error logs. For example::
|
||||
|
||||
tools/testing/selftests/rcutorture/bin/kvm-find-errors.sh \
|
||||
tools/testing/selftests/rcutorture/res/2020.01.20-15.54.23
|
||||
@@ -245,38 +252,42 @@ that was tested and any uncommitted changes in diff format.
|
||||
|
||||
The most frequently used files in each per-scenario-run directory are:
|
||||
|
||||
.config: This file contains the Kconfig options.
|
||||
.config:
|
||||
This file contains the Kconfig options.
|
||||
|
||||
Make.out: This contains build output for a specific scenario.
|
||||
Make.out:
|
||||
This contains build output for a specific scenario.
|
||||
|
||||
console.log: This contains the console output for a specific scenario.
|
||||
console.log:
|
||||
This contains the console output for a specific scenario.
|
||||
This file may be examined once the kernel has booted, but
|
||||
it might not exist if the build failed.
|
||||
|
||||
vmlinux: This contains the kernel, which can be useful with tools like
|
||||
vmlinux:
|
||||
This contains the kernel, which can be useful with tools like
|
||||
objdump and gdb.
|
||||
|
||||
A number of additional files are available, but are less frequently used.
|
||||
Many are intended for debugging of rcutorture itself or of its scripting.
|
||||
|
||||
As of v5.4, a successful run with the default set of scenarios produces
|
||||
the following summary at the end of the run on a 12-CPU system:
|
||||
the following summary at the end of the run on a 12-CPU system::
|
||||
|
||||
SRCU-N ------- 804233 GPs (148.932/s) [srcu: g10008272 f0x0 ]
|
||||
SRCU-P ------- 202320 GPs (37.4667/s) [srcud: g1809476 f0x0 ]
|
||||
SRCU-t ------- 1122086 GPs (207.794/s) [srcu: g0 f0x0 ]
|
||||
SRCU-u ------- 1111285 GPs (205.794/s) [srcud: g1 f0x0 ]
|
||||
TASKS01 ------- 19666 GPs (3.64185/s) [tasks: g0 f0x0 ]
|
||||
TASKS02 ------- 20541 GPs (3.80389/s) [tasks: g0 f0x0 ]
|
||||
TASKS03 ------- 19416 GPs (3.59556/s) [tasks: g0 f0x0 ]
|
||||
TINY01 ------- 836134 GPs (154.84/s) [rcu: g0 f0x0 ] n_max_cbs: 34198
|
||||
TINY02 ------- 850371 GPs (157.476/s) [rcu: g0 f0x0 ] n_max_cbs: 2631
|
||||
TREE01 ------- 162625 GPs (30.1157/s) [rcu: g1124169 f0x0 ]
|
||||
TREE02 ------- 333003 GPs (61.6672/s) [rcu: g2647753 f0x0 ] n_max_cbs: 35844
|
||||
TREE03 ------- 306623 GPs (56.782/s) [rcu: g2975325 f0x0 ] n_max_cbs: 1496497
|
||||
CPU count limited from 16 to 12
|
||||
TREE04 ------- 246149 GPs (45.5831/s) [rcu: g1695737 f0x0 ] n_max_cbs: 434961
|
||||
TREE05 ------- 314603 GPs (58.2598/s) [rcu: g2257741 f0x2 ] n_max_cbs: 193997
|
||||
TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732
|
||||
CPU count limited from 16 to 12
|
||||
TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011
|
||||
SRCU-N ------- 804233 GPs (148.932/s) [srcu: g10008272 f0x0 ]
|
||||
SRCU-P ------- 202320 GPs (37.4667/s) [srcud: g1809476 f0x0 ]
|
||||
SRCU-t ------- 1122086 GPs (207.794/s) [srcu: g0 f0x0 ]
|
||||
SRCU-u ------- 1111285 GPs (205.794/s) [srcud: g1 f0x0 ]
|
||||
TASKS01 ------- 19666 GPs (3.64185/s) [tasks: g0 f0x0 ]
|
||||
TASKS02 ------- 20541 GPs (3.80389/s) [tasks: g0 f0x0 ]
|
||||
TASKS03 ------- 19416 GPs (3.59556/s) [tasks: g0 f0x0 ]
|
||||
TINY01 ------- 836134 GPs (154.84/s) [rcu: g0 f0x0 ] n_max_cbs: 34198
|
||||
TINY02 ------- 850371 GPs (157.476/s) [rcu: g0 f0x0 ] n_max_cbs: 2631
|
||||
TREE01 ------- 162625 GPs (30.1157/s) [rcu: g1124169 f0x0 ]
|
||||
TREE02 ------- 333003 GPs (61.6672/s) [rcu: g2647753 f0x0 ] n_max_cbs: 35844
|
||||
TREE03 ------- 306623 GPs (56.782/s) [rcu: g2975325 f0x0 ] n_max_cbs: 1496497
|
||||
CPU count limited from 16 to 12
|
||||
TREE04 ------- 246149 GPs (45.5831/s) [rcu: g1695737 f0x0 ] n_max_cbs: 434961
|
||||
TREE05 ------- 314603 GPs (58.2598/s) [rcu: g2257741 f0x2 ] n_max_cbs: 193997
|
||||
TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732
|
||||
CPU count limited from 16 to 12
|
||||
TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011
|
@@ -4038,6 +4038,14 @@
|
||||
latencies, which will choose a value aligned
|
||||
with the appropriate hardware boundaries.
|
||||
|
||||
rcutree.rcu_min_cached_objs= [KNL]
|
||||
Minimum number of objects which are cached and
|
||||
maintained per one CPU. Object size is equal
|
||||
to PAGE_SIZE. The cache allows to reduce the
|
||||
pressure to page allocator, also it makes the
|
||||
whole algorithm to behave better in low memory
|
||||
condition.
|
||||
|
||||
rcutree.jiffies_till_first_fqs= [KNL]
|
||||
Set delay from grace-period initialization to
|
||||
first attempt to force quiescent states.
|
||||
@@ -4258,6 +4266,20 @@
|
||||
Set time (jiffies) between CPU-hotplug operations,
|
||||
or zero to disable CPU-hotplug testing.
|
||||
|
||||
rcutorture.read_exit= [KNL]
|
||||
Set the number of read-then-exit kthreads used
|
||||
to test the interaction of RCU updaters and
|
||||
task-exit processing.
|
||||
|
||||
rcutorture.read_exit_burst= [KNL]
|
||||
The number of times in a given read-then-exit
|
||||
episode that a set of read-then-exit kthreads
|
||||
is spawned.
|
||||
|
||||
rcutorture.read_exit_delay= [KNL]
|
||||
The delay, in seconds, between successive
|
||||
read-then-exit testing episodes.
|
||||
|
||||
rcutorture.shuffle_interval= [KNL]
|
||||
Set task-shuffle interval (s). Shuffling tasks
|
||||
allows some CPUs to go into dyntick-idle mode
|
||||
@@ -4407,6 +4429,45 @@
|
||||
reboot_cpu is s[mp]#### with #### being the processor
|
||||
to be used for rebooting.
|
||||
|
||||
refscale.holdoff= [KNL]
|
||||
Set test-start holdoff period. The purpose of
|
||||
this parameter is to delay the start of the
|
||||
test until boot completes in order to avoid
|
||||
interference.
|
||||
|
||||
refscale.loops= [KNL]
|
||||
Set the number of loops over the synchronization
|
||||
primitive under test. Increasing this number
|
||||
reduces noise due to loop start/end overhead,
|
||||
but the default has already reduced the per-pass
|
||||
noise to a handful of picoseconds on ca. 2020
|
||||
x86 laptops.
|
||||
|
||||
refscale.nreaders= [KNL]
|
||||
Set number of readers. The default value of -1
|
||||
selects N, where N is roughly 75% of the number
|
||||
of CPUs. A value of zero is an interesting choice.
|
||||
|
||||
refscale.nruns= [KNL]
|
||||
Set number of runs, each of which is dumped onto
|
||||
the console log.
|
||||
|
||||
refscale.readdelay= [KNL]
|
||||
Set the read-side critical-section duration,
|
||||
measured in microseconds.
|
||||
|
||||
refscale.scale_type= [KNL]
|
||||
Specify the read-protection implementation to test.
|
||||
|
||||
refscale.shutdown= [KNL]
|
||||
Shut down the system at the end of the performance
|
||||
test. This defaults to 1 (shut it down) when
|
||||
rcuperf is built into the kernel and to 0 (leave
|
||||
it running) when rcuperf is built as a module.
|
||||
|
||||
refscale.verbose= [KNL]
|
||||
Enable additional printk() statements.
|
||||
|
||||
relax_domain_level=
|
||||
[KNL, SMP] Set scheduler's default relax_domain_level.
|
||||
See Documentation/admin-guide/cgroup-v1/cpusets.rst.
|
||||
@@ -5082,6 +5143,13 @@
|
||||
Prevent the CPU-hotplug component of torturing
|
||||
until after init has spawned.
|
||||
|
||||
torture.ftrace_dump_at_shutdown= [KNL]
|
||||
Dump the ftrace buffer at torture-test shutdown,
|
||||
even if there were no errors. This can be a
|
||||
very costly operation when many torture tests
|
||||
are running concurrently, especially on systems
|
||||
with rotating-rust storage.
|
||||
|
||||
tp720= [HW,PS2]
|
||||
|
||||
tpm_suspend_pcr=[HW,TPM]
|
||||
|
@@ -166,4 +166,4 @@ checked for such errors. The "rmmod" command forces a "SUCCESS",
|
||||
two are self-explanatory, while the last indicates that while there
|
||||
were no locking failures, CPU-hotplug problems were detected.
|
||||
|
||||
Also see: Documentation/RCU/torture.txt
|
||||
Also see: Documentation/RCU/torture.rst
|
||||
|
Reference in New Issue
Block a user