123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367 |
- On atomic types (atomic_t atomic64_t and atomic_long_t).
- The atomic type provides an interface to the architecture's means of atomic
- RMW operations between CPUs (atomic operations on MMIO are not supported and
- can lead to fatal traps on some platforms).
- API
- ---
- The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
- brevity):
- Non-RMW ops:
- atomic_read(), atomic_set()
- atomic_read_acquire(), atomic_set_release()
- RMW atomic operations:
- Arithmetic:
- atomic_{add,sub,inc,dec}()
- atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
- atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
- Bitwise:
- atomic_{and,or,xor,andnot}()
- atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
- Swap:
- atomic_xchg{,_relaxed,_acquire,_release}()
- atomic_cmpxchg{,_relaxed,_acquire,_release}()
- atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
- Reference count (but please see refcount_t):
- atomic_add_unless(), atomic_inc_not_zero()
- atomic_sub_and_test(), atomic_dec_and_test()
- Misc:
- atomic_inc_and_test(), atomic_add_negative()
- atomic_dec_unless_positive(), atomic_inc_unless_negative()
- Barriers:
- smp_mb__{before,after}_atomic()
- TYPES (signed vs unsigned)
- -----
- While atomic_t, atomic_long_t and atomic64_t use int, long and s64
- respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
- (which implies -fwrapv) and defines signed overflow to behave like
- 2s-complement.
- Therefore, an explicitly unsigned variant of the atomic ops is strictly
- unnecessary and we can simply cast, there is no UB.
- There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
- signed types.
- With this we also conform to the C/C++ _Atomic behaviour and things like
- P1236R1.
- SEMANTICS
- ---------
- Non-RMW ops:
- The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
- implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
- smp_store_release() respectively. Therefore, if you find yourself only using
- the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
- and are doing it wrong.
- A note for the implementation of atomic_set{}() is that it must not break the
- atomicity of the RMW ops. That is:
- C Atomic-RMW-ops-are-atomic-WRT-atomic_set
- {
- atomic_t v = ATOMIC_INIT(1);
- }
- P0(atomic_t *v)
- {
- (void)atomic_add_unless(v, 1, 0);
- }
- P1(atomic_t *v)
- {
- atomic_set(v, 0);
- }
- exists
- (v=2)
- In this case we would expect the atomic_set() from CPU1 to either happen
- before the atomic_add_unless(), in which case that latter one would no-op, or
- _after_ in which case we'd overwrite its result. In no case is "2" a valid
- outcome.
- This is typically true on 'normal' platforms, where a regular competing STORE
- will invalidate a LL/SC or fail a CMPXCHG.
- The obvious case where this is not so is when we need to implement atomic ops
- with a lock:
- CPU0 CPU1
- atomic_add_unless(v, 1, 0);
- lock();
- ret = READ_ONCE(v->counter); // == 1
- atomic_set(v, 0);
- if (ret != u) WRITE_ONCE(v->counter, 0);
- WRITE_ONCE(v->counter, ret + 1);
- unlock();
- the typical solution is to then implement atomic_set{}() with atomic_xchg().
- RMW ops:
- These come in various forms:
- - plain operations without return value: atomic_{}()
- - operations which return the modified value: atomic_{}_return()
- these are limited to the arithmetic operations because those are
- reversible. Bitops are irreversible and therefore the modified value
- is of dubious utility.
- - operations which return the original value: atomic_fetch_{}()
- - swap operations: xchg(), cmpxchg() and try_cmpxchg()
- - misc; the special purpose operations that are commonly used and would,
- given the interface, normally be implemented using (try_)cmpxchg loops but
- are time critical and can, (typically) on LL/SC architectures, be more
- efficiently implemented.
- All these operations are SMP atomic; that is, the operations (for a single
- atomic variable) can be fully ordered and no intermediate state is lost or
- visible.
- ORDERING (go read memory-barriers.txt first)
- --------
- The rule of thumb:
- - non-RMW operations are unordered;
- - RMW operations that have no return value are unordered;
- - RMW operations that have a return value are fully ordered;
- - RMW operations that are conditional are unordered on FAILURE,
- otherwise the above rules apply.
- Except of course when an operation has an explicit ordering like:
- {}_relaxed: unordered
- {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
- {}_release: the W of the RMW (or atomic_set) is a RELEASE
- Where 'unordered' is against other memory locations. Address dependencies are
- not defeated.
- Fully ordered primitives are ordered against everything prior and everything
- subsequent. Therefore a fully ordered primitive is like having an smp_mb()
- before and an smp_mb() after the primitive.
- The barriers:
- smp_mb__{before,after}_atomic()
- only apply to the RMW atomic ops and can be used to augment/upgrade the
- ordering inherent to the op. These barriers act almost like a full smp_mb():
- smp_mb__before_atomic() orders all earlier accesses against the RMW op
- itself and all accesses following it, and smp_mb__after_atomic() orders all
- later accesses against the RMW op and all accesses preceding it. However,
- accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
- ordered, so it is advisable to place the barrier right next to the RMW atomic
- op whenever possible.
- These helper barriers exist because architectures have varying implicit
- ordering on their SMP atomic primitives. For example our TSO architectures
- provide full ordered atomics and these barriers are no-ops.
- NOTE: when the atomic RmW ops are fully ordered, they should also imply a
- compiler barrier.
- Thus:
- atomic_fetch_add();
- is equivalent to:
- smp_mb__before_atomic();
- atomic_fetch_add_relaxed();
- smp_mb__after_atomic();
- However the atomic_fetch_add() might be implemented more efficiently.
- Further, while something like:
- smp_mb__before_atomic();
- atomic_dec(&X);
- is a 'typical' RELEASE pattern, the barrier is strictly stronger than
- a RELEASE because it orders preceding instructions against both the read
- and write parts of the atomic_dec(), and against all following instructions
- as well. Similarly, something like:
- atomic_inc(&X);
- smp_mb__after_atomic();
- is an ACQUIRE pattern (though very much not typical), but again the barrier is
- strictly stronger than ACQUIRE. As illustrated:
- C Atomic-RMW+mb__after_atomic-is-stronger-than-acquire
- {
- }
- P0(int *x, atomic_t *y)
- {
- r0 = READ_ONCE(*x);
- smp_rmb();
- r1 = atomic_read(y);
- }
- P1(int *x, atomic_t *y)
- {
- atomic_inc(y);
- smp_mb__after_atomic();
- WRITE_ONCE(*x, 1);
- }
- exists
- (0:r0=1 /\ 0:r1=0)
- This should not happen; but a hypothetical atomic_inc_acquire() --
- (void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
- because it would not order the W part of the RMW against the following
- WRITE_ONCE. Thus:
- P0 P1
- t = LL.acq *y (0)
- t++;
- *x = 1;
- r0 = *x (1)
- RMB
- r1 = *y (0)
- SC *y, t;
- is allowed.
- CMPXCHG vs TRY_CMPXCHG
- ----------------------
- int atomic_cmpxchg(atomic_t *ptr, int old, int new);
- bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new);
- Both provide the same functionality, but try_cmpxchg() can lead to more
- compact code. The functions relate like:
- bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new)
- {
- int ret, old = *oldp;
- ret = atomic_cmpxchg(ptr, old, new);
- if (ret != old)
- *oldp = ret;
- return ret == old;
- }
- and:
- int atomic_cmpxchg(atomic_t *ptr, int old, int new)
- {
- (void)atomic_try_cmpxchg(ptr, &old, new);
- return old;
- }
- Usage:
- old = atomic_read(&v); old = atomic_read(&v);
- for (;;) { do {
- new = func(old); new = func(old);
- tmp = atomic_cmpxchg(&v, old, new); } while (!atomic_try_cmpxchg(&v, &old, new));
- if (tmp == old)
- break;
- old = tmp;
- }
- NB. try_cmpxchg() also generates better code on some platforms (notably x86)
- where the function more closely matches the hardware instruction.
- FORWARD PROGRESS
- ----------------
- In general strong forward progress is expected of all unconditional atomic
- operations -- those in the Arithmetic and Bitwise classes and xchg(). However
- a fair amount of code also requires forward progress from the conditional
- atomic operations.
- Specifically 'simple' cmpxchg() loops are expected to not starve one another
- indefinitely. However, this is not evident on LL/SC architectures, because
- while an LL/SC architecure 'can/should/must' provide forward progress
- guarantees between competing LL/SC sections, such a guarantee does not
- transfer to cmpxchg() implemented using LL/SC. Consider:
- old = atomic_read(&v);
- do {
- new = func(old);
- } while (!atomic_try_cmpxchg(&v, &old, new));
- which on LL/SC becomes something like:
- old = atomic_read(&v);
- do {
- new = func(old);
- } while (!({
- volatile asm ("1: LL %[oldval], %[v]\n"
- " CMP %[oldval], %[old]\n"
- " BNE 2f\n"
- " SC %[new], %[v]\n"
- " BNE 1b\n"
- "2:\n"
- : [oldval] "=&r" (oldval), [v] "m" (v)
- : [old] "r" (old), [new] "r" (new)
- : "memory");
- success = (oldval == old);
- if (!success)
- old = oldval;
- success; }));
- However, even the forward branch from the failed compare can cause the LL/SC
- to fail on some architectures, let alone whatever the compiler makes of the C
- loop body. As a result there is no guarantee what so ever the cacheline
- containing @v will stay on the local CPU and progress is made.
- Even native CAS architectures can fail to provide forward progress for their
- primitive (See Sparc64 for an example).
- Such implementations are strongly encouraged to add exponential backoff loops
- to a failed CAS in order to ensure some progress. Affected architectures are
- also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
- their locking primitives.
|