123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556 |
- This document gives an overview of the categories of memory-ordering
- operations provided by the Linux-kernel memory model (LKMM).
- Categories of Ordering
- ======================
- This section lists LKMM's three top-level categories of memory-ordering
- operations in decreasing order of strength:
- 1. Barriers (also known as "fences"). A barrier orders some or
- all of the CPU's prior operations against some or all of its
- subsequent operations.
- 2. Ordered memory accesses. These operations order themselves
- against some or all of the CPU's prior accesses or some or all
- of the CPU's subsequent accesses, depending on the subcategory
- of the operation.
- 3. Unordered accesses, as the name indicates, have no ordering
- properties except to the extent that they interact with an
- operation in the previous categories. This being the real world,
- some of these "unordered" operations provide limited ordering
- in some special situations.
- Each of the above categories is described in more detail by one of the
- following sections.
- Barriers
- ========
- Each of the following categories of barriers is described in its own
- subsection below:
- a. Full memory barriers.
- b. Read-modify-write (RMW) ordering augmentation barriers.
- c. Write memory barrier.
- d. Read memory barrier.
- e. Compiler barrier.
- Note well that many of these primitives generate absolutely no code
- in kernels built with CONFIG_SMP=n. Therefore, if you are writing
- a device driver, which must correctly order accesses to a physical
- device even in kernels built with CONFIG_SMP=n, please use the
- ordering primitives provided for that purpose. For example, instead of
- smp_mb(), use mb(). See the "Linux Kernel Device Drivers" book or the
- https://lwn.net/Articles/698014/ article for more information.
- Full Memory Barriers
- --------------------
- The Linux-kernel primitives that provide full ordering include:
- o The smp_mb() full memory barrier.
- o Value-returning RMW atomic operations whose names do not end in
- _acquire, _release, or _relaxed.
- o RCU's grace-period primitives.
- First, the smp_mb() full memory barrier orders all of the CPU's prior
- accesses against all subsequent accesses from the viewpoint of all CPUs.
- In other words, all CPUs will agree that any earlier action taken
- by that CPU happened before any later action taken by that same CPU.
- For example, consider the following:
- WRITE_ONCE(x, 1);
- smp_mb(); // Order store to x before load from y.
- r1 = READ_ONCE(y);
- All CPUs will agree that the store to "x" happened before the load
- from "y", as indicated by the comment. And yes, please comment your
- memory-ordering primitives. It is surprisingly hard to remember their
- purpose after even a few months.
- Second, some RMW atomic operations provide full ordering. These
- operations include value-returning RMW atomic operations (that is, those
- with non-void return types) whose names do not end in _acquire, _release,
- or _relaxed. Examples include atomic_add_return(), atomic_dec_and_test(),
- cmpxchg(), and xchg(). Note that conditional RMW atomic operations such
- as cmpxchg() are only guaranteed to provide ordering when they succeed.
- When RMW atomic operations provide full ordering, they partition the
- CPU's accesses into three groups:
- 1. All code that executed prior to the RMW atomic operation.
- 2. The RMW atomic operation itself.
- 3. All code that executed after the RMW atomic operation.
- All CPUs will agree that any operation in a given partition happened
- before any operation in a higher-numbered partition.
- In contrast, non-value-returning RMW atomic operations (that is, those
- with void return types) do not guarantee any ordering whatsoever. Nor do
- value-returning RMW atomic operations whose names end in _relaxed.
- Examples of the former include atomic_inc() and atomic_dec(),
- while examples of the latter include atomic_cmpxchg_relaxed() and
- atomic_xchg_relaxed(). Similarly, value-returning non-RMW atomic
- operations such as atomic_read() do not guarantee full ordering, and
- are covered in the later section on unordered operations.
- Value-returning RMW atomic operations whose names end in _acquire or
- _release provide limited ordering, and will be described later in this
- document.
- Finally, RCU's grace-period primitives provide full ordering. These
- primitives include synchronize_rcu(), synchronize_rcu_expedited(),
- synchronize_srcu() and so on. However, these primitives have orders
- of magnitude greater overhead than smp_mb(), atomic_xchg(), and so on.
- Furthermore, RCU's grace-period primitives can only be invoked in
- sleepable contexts. Therefore, RCU's grace-period primitives are
- typically instead used to provide ordering against RCU read-side critical
- sections, as documented in their comment headers. But of course if you
- need a synchronize_rcu() to interact with readers, it costs you nothing
- to also rely on its additional full-memory-barrier semantics. Just please
- carefully comment this, otherwise your future self will hate you.
- RMW Ordering Augmentation Barriers
- ----------------------------------
- As noted in the previous section, non-value-returning RMW operations
- such as atomic_inc() and atomic_dec() guarantee no ordering whatsoever.
- Nevertheless, a number of popular CPU families, including x86, provide
- full ordering for these primitives. One way to obtain full ordering on
- all architectures is to add a call to smp_mb():
- WRITE_ONCE(x, 1);
- atomic_inc(&my_counter);
- smp_mb(); // Inefficient on x86!!!
- r1 = READ_ONCE(y);
- This works, but the added smp_mb() adds needless overhead for
- x86, on which atomic_inc() provides full ordering all by itself.
- The smp_mb__after_atomic() primitive can be used instead:
- WRITE_ONCE(x, 1);
- atomic_inc(&my_counter);
- smp_mb__after_atomic(); // Order store to x before load from y.
- r1 = READ_ONCE(y);
- The smp_mb__after_atomic() primitive emits code only on CPUs whose
- atomic_inc() implementations do not guarantee full ordering, thus
- incurring no unnecessary overhead on x86. There are a number of
- variations on the smp_mb__*() theme:
- o smp_mb__before_atomic(), which provides full ordering prior
- to an unordered RMW atomic operation.
- o smp_mb__after_atomic(), which, as shown above, provides full
- ordering subsequent to an unordered RMW atomic operation.
- o smp_mb__after_spinlock(), which provides full ordering subsequent
- to a successful spinlock acquisition. Note that spin_lock() is
- always successful but spin_trylock() might not be.
- o smp_mb__after_srcu_read_unlock(), which provides full ordering
- subsequent to an srcu_read_unlock().
- It is bad practice to place code between the smp__*() primitive and the
- operation whose ordering that it is augmenting. The reason is that the
- ordering of this intervening code will differ from one CPU architecture
- to another.
- Write Memory Barrier
- --------------------
- The Linux kernel's write memory barrier is smp_wmb(). If a CPU executes
- the following code:
- WRITE_ONCE(x, 1);
- smp_wmb();
- WRITE_ONCE(y, 1);
- Then any given CPU will see the write to "x" has having happened before
- the write to "y". However, you are usually better off using a release
- store, as described in the "Release Operations" section below.
- Note that smp_wmb() might fail to provide ordering for unmarked C-language
- stores because profile-driven optimization could determine that the
- value being overwritten is almost always equal to the new value. Such a
- compiler might then reasonably decide to transform "x = 1" and "y = 1"
- as follows:
- if (x != 1)
- x = 1;
- smp_wmb(); // BUG: does not order the reads!!!
- if (y != 1)
- y = 1;
- Therefore, if you need to use smp_wmb() with unmarked C-language writes,
- you will need to make sure that none of the compilers used to build
- the Linux kernel carry out this sort of transformation, both now and in
- the future.
- Read Memory Barrier
- -------------------
- The Linux kernel's read memory barrier is smp_rmb(). If a CPU executes
- the following code:
- r0 = READ_ONCE(y);
- smp_rmb();
- r1 = READ_ONCE(x);
- Then any given CPU will see the read from "y" as having preceded the read from
- "x". However, you are usually better off using an acquire load, as described
- in the "Acquire Operations" section below.
- Compiler Barrier
- ----------------
- The Linux kernel's compiler barrier is barrier(). This primitive
- prohibits compiler code-motion optimizations that might move memory
- references across the point in the code containing the barrier(), but
- does not constrain hardware memory ordering. For example, this can be
- used to prevent to compiler from moving code across an infinite loop:
- WRITE_ONCE(x, 1);
- while (dontstop)
- barrier();
- r1 = READ_ONCE(y);
- Without the barrier(), the compiler would be within its rights to move the
- WRITE_ONCE() to follow the loop. This code motion could be problematic
- in the case where an interrupt handler terminates the loop. Another way
- to handle this is to use READ_ONCE() for the load of "dontstop".
- Note that the barriers discussed previously use barrier() or its low-level
- equivalent in their implementations.
- Ordered Memory Accesses
- =======================
- The Linux kernel provides a wide variety of ordered memory accesses:
- a. Release operations.
- b. Acquire operations.
- c. RCU read-side ordering.
- d. Control dependencies.
- Each of the above categories has its own section below.
- Release Operations
- ------------------
- Release operations include smp_store_release(), atomic_set_release(),
- rcu_assign_pointer(), and value-returning RMW operations whose names
- end in _release. These operations order their own store against all
- of the CPU's prior memory accesses. Release operations often provide
- improved readability and performance compared to explicit barriers.
- For example, use of smp_store_release() saves a line compared to the
- smp_wmb() example above:
- WRITE_ONCE(x, 1);
- smp_store_release(&y, 1);
- More important, smp_store_release() makes it easier to connect up the
- different pieces of the concurrent algorithm. The variable stored to
- by the smp_store_release(), in this case "y", will normally be used in
- an acquire operation in other parts of the concurrent algorithm.
- To see the performance advantages, suppose that the above example read
- from "x" instead of writing to it. Then an smp_wmb() could not guarantee
- ordering, and an smp_mb() would be needed instead:
- r1 = READ_ONCE(x);
- smp_mb();
- WRITE_ONCE(y, 1);
- But smp_mb() often incurs much higher overhead than does
- smp_store_release(), which still provides the needed ordering of "x"
- against "y". On x86, the version using smp_store_release() might compile
- to a simple load instruction followed by a simple store instruction.
- In contrast, the smp_mb() compiles to an expensive instruction that
- provides the needed ordering.
- There is a wide variety of release operations:
- o Store operations, including not only the aforementioned
- smp_store_release(), but also atomic_set_release(), and
- atomic_long_set_release().
- o RCU's rcu_assign_pointer() operation. This is the same as
- smp_store_release() except that: (1) It takes the pointer to
- be assigned to instead of a pointer to that pointer, (2) It
- is intended to be used in conjunction with rcu_dereference()
- and similar rather than smp_load_acquire(), and (3) It checks
- for an RCU-protected pointer in "sparse" runs.
- o Value-returning RMW operations whose names end in _release,
- such as atomic_fetch_add_release() and cmpxchg_release().
- Note that release ordering is guaranteed only against the
- memory-store portion of the RMW operation, and not against the
- memory-load portion. Note also that conditional operations such
- as cmpxchg_release() are only guaranteed to provide ordering
- when they succeed.
- As mentioned earlier, release operations are often paired with acquire
- operations, which are the subject of the next section.
- Acquire Operations
- ------------------
- Acquire operations include smp_load_acquire(), atomic_read_acquire(),
- and value-returning RMW operations whose names end in _acquire. These
- operations order their own load against all of the CPU's subsequent
- memory accesses. Acquire operations often provide improved performance
- and readability compared to explicit barriers. For example, use of
- smp_load_acquire() saves a line compared to the smp_rmb() example above:
- r0 = smp_load_acquire(&y);
- r1 = READ_ONCE(x);
- As with smp_store_release(), this also makes it easier to connect
- the different pieces of the concurrent algorithm by looking for the
- smp_store_release() that stores to "y". In addition, smp_load_acquire()
- improves upon smp_rmb() by ordering against subsequent stores as well
- as against subsequent loads.
- There are a couple of categories of acquire operations:
- o Load operations, including not only the aforementioned
- smp_load_acquire(), but also atomic_read_acquire(), and
- atomic64_read_acquire().
- o Value-returning RMW operations whose names end in _acquire,
- such as atomic_xchg_acquire() and atomic_cmpxchg_acquire().
- Note that acquire ordering is guaranteed only against the
- memory-load portion of the RMW operation, and not against the
- memory-store portion. Note also that conditional operations
- such as atomic_cmpxchg_acquire() are only guaranteed to provide
- ordering when they succeed.
- Symmetry being what it is, acquire operations are often paired with the
- release operations covered earlier. For example, consider the following
- example, where task0() and task1() execute concurrently:
- void task0(void)
- {
- WRITE_ONCE(x, 1);
- smp_store_release(&y, 1);
- }
- void task1(void)
- {
- r0 = smp_load_acquire(&y);
- r1 = READ_ONCE(x);
- }
- If "x" and "y" are both initially zero, then either r0's final value
- will be zero or r1's final value will be one, thus providing the required
- ordering.
- RCU Read-Side Ordering
- ----------------------
- This category includes read-side markers such as rcu_read_lock()
- and rcu_read_unlock() as well as pointer-traversal primitives such as
- rcu_dereference() and srcu_dereference().
- Compared to locking primitives and RMW atomic operations, markers
- for RCU read-side critical sections incur very low overhead because
- they interact only with the corresponding grace-period primitives.
- For example, the rcu_read_lock() and rcu_read_unlock() markers interact
- with synchronize_rcu(), synchronize_rcu_expedited(), and call_rcu().
- The way this works is that if a given call to synchronize_rcu() cannot
- prove that it started before a given call to rcu_read_lock(), then
- that synchronize_rcu() must block until the matching rcu_read_unlock()
- is reached. For more information, please see the synchronize_rcu()
- docbook header comment and the material in Documentation/RCU.
- RCU's pointer-traversal primitives, including rcu_dereference() and
- srcu_dereference(), order their load (which must be a pointer) against any
- of the CPU's subsequent memory accesses whose address has been calculated
- from the value loaded. There is said to be an *address dependency*
- from the value returned by the rcu_dereference() or srcu_dereference()
- to that subsequent memory access.
- A call to rcu_dereference() for a given RCU-protected pointer is
- usually paired with a call to a call to rcu_assign_pointer() for that
- same pointer in much the same way that a call to smp_load_acquire() is
- paired with a call to smp_store_release(). Calls to rcu_dereference()
- and rcu_assign_pointer are often buried in other APIs, for example,
- the RCU list API members defined in include/linux/rculist.h. For more
- information, please see the docbook headers in that file, the most
- recent LWN article on the RCU API (https://lwn.net/Articles/777036/),
- and of course the material in Documentation/RCU.
- If the pointer value is manipulated between the rcu_dereference()
- that returned it and a later dereference(), please read
- Documentation/RCU/rcu_dereference.rst. It can also be quite helpful to
- review uses in the Linux kernel.
- Control Dependencies
- --------------------
- A control dependency extends from a marked load (READ_ONCE() or stronger)
- through an "if" condition to a marked store (WRITE_ONCE() or stronger)
- that is executed only by one of the legs of that "if" statement.
- Control dependencies are so named because they are mediated by
- control-flow instructions such as comparisons and conditional branches.
- In short, you can use a control dependency to enforce ordering between
- an READ_ONCE() and a WRITE_ONCE() when there is an "if" condition
- between them. The canonical example is as follows:
- q = READ_ONCE(a);
- if (q)
- WRITE_ONCE(b, 1);
- In this case, all CPUs would see the read from "a" as happening before
- the write to "b".
- However, control dependencies are easily destroyed by compiler
- optimizations, so any use of control dependencies must take into account
- all of the compilers used to build the Linux kernel. Please see the
- "control-dependencies.txt" file for more information.
- Unordered Accesses
- ==================
- Each of these two categories of unordered accesses has a section below:
- a. Unordered marked operations.
- b. Unmarked C-language accesses.
- Unordered Marked Operations
- ---------------------------
- Unordered operations to different variables are just that, unordered.
- However, if a group of CPUs apply these operations to a single variable,
- all the CPUs will agree on the operation order. Of course, the ordering
- of unordered marked accesses can also be constrained using the mechanisms
- described earlier in this document.
- These operations come in three categories:
- o Marked writes, such as WRITE_ONCE() and atomic_set(). These
- primitives required the compiler to emit the corresponding store
- instructions in the expected execution order, thus suppressing
- a number of destructive optimizations. However, they provide no
- hardware ordering guarantees, and in fact many CPUs will happily
- reorder marked writes with each other or with other unordered
- operations, unless these operations are to the same variable.
- o Marked reads, such as READ_ONCE() and atomic_read(). These
- primitives required the compiler to emit the corresponding load
- instructions in the expected execution order, thus suppressing
- a number of destructive optimizations. However, they provide no
- hardware ordering guarantees, and in fact many CPUs will happily
- reorder marked reads with each other or with other unordered
- operations, unless these operations are to the same variable.
- o Unordered RMW atomic operations. These are non-value-returning
- RMW atomic operations whose names do not end in _acquire or
- _release, and also value-returning RMW operations whose names
- end in _relaxed. Examples include atomic_add(), atomic_or(),
- and atomic64_fetch_xor_relaxed(). These operations do carry
- out the specified RMW operation atomically, for example, five
- concurrent atomic_inc() operations applied to a given variable
- will reliably increase the value of that variable by five.
- However, many CPUs will happily reorder these operations with
- each other or with other unordered operations.
- This category of operations can be efficiently ordered using
- smp_mb__before_atomic() and smp_mb__after_atomic(), as was
- discussed in the "RMW Ordering Augmentation Barriers" section.
- In short, these operations can be freely reordered unless they are all
- operating on a single variable or unless they are constrained by one of
- the operations called out earlier in this document.
- Unmarked C-Language Accesses
- ----------------------------
- Unmarked C-language accesses are normal variable accesses to normal
- variables, that is, to variables that are not "volatile" and are not
- C11 atomic variables. These operations provide no ordering guarantees,
- and further do not guarantee "atomic" access. For example, the compiler
- might (and sometimes does) split a plain C-language store into multiple
- smaller stores. A load from that same variable running on some other
- CPU while such a store is executing might see a value that is a mashup
- of the old value and the new value.
- Unmarked C-language accesses are unordered, and are also subject to
- any number of compiler optimizations, many of which can break your
- concurrent code. It is possible to used unmarked C-language accesses for
- shared variables that are subject to concurrent access, but great care
- is required on an ongoing basis. The compiler-constraining barrier()
- primitive can be helpful, as can the various ordering primitives discussed
- in this document. It nevertheless bears repeating that use of unmarked
- C-language accesses requires careful attention to not just your code,
- but to all the compilers that might be used to build it. Such compilers
- might replace a series of loads with a single load, and might replace
- a series of stores with a single store. Some compilers will even split
- a single store into multiple smaller stores.
- But there are some ways of using unmarked C-language accesses for shared
- variables without such worries:
- o Guard all accesses to a given variable by a particular lock,
- so that there are never concurrent conflicting accesses to
- that variable. (There are "conflicting accesses" when
- (1) at least one of the concurrent accesses to a variable is an
- unmarked C-language access and (2) when at least one of those
- accesses is a write, whether marked or not.)
- o As above, but using other synchronization primitives such
- as reader-writer locks or sequence locks.
- o Use locking or other means to ensure that all concurrent accesses
- to a given variable are reads.
- o Restrict use of a given variable to statistics or heuristics
- where the occasional bogus value can be tolerated.
- o Declare the accessed variables as C11 atomics.
- https://lwn.net/Articles/691128/
- o Declare the accessed variables as "volatile".
- If you need to live more dangerously, please do take the time to
- understand the compilers. One place to start is these two LWN
- articles:
- Who's afraid of a big bad optimizing compiler?
- https://lwn.net/Articles/793253
- Calibrating your fear of big bad optimizing compilers
- https://lwn.net/Articles/799218
- Used properly, unmarked C-language accesses can reduce overhead on
- fastpaths. However, the price is great care and continual attention
- to your compiler as new versions come out and as new optimizations
- are enabled.
|