On Github jfbastien / no-sane-compiler
JF Bastien <@jfbastien>
Compilers do optimize atomics, memory accesses around atomics, and utilize architecture-specific knowledge. My hobby is to encourage compilers to do more of this, programmers to rely on it, and hardware vendors to give us new atomic toys to optimize with. Oh, and standardize yet more close-to-the-metal concurrency and parallelism tools.
But, you say, surely volatile always means volatile, there’s nothing wrong with my benign races, nothing could even go wrong with non-temporal accesses, and who needs 6 memory orderings anyways‽ I’m glad you asked, let me tell you about my hobby…
Threads didn't exist before C++11
Naïve model: memory accesses are simply interleaved
Hardware doesn't work that way
Overly restrictive
Limits compiler + HW reordering / optimizations; access granularity; want to reason about atomic code regions; fences cost 2 to 200 cycles depending on architectureData race:
Data races are UB
You should use address sanitizer1.10 Multi-threaded executions and data races
Atomic: indivisible with respect to all other atomic accesses to that object
No tearing, no duplication, no elision.racing accesses
indivisible
enforce cache coherency
doesn't guarantee visibility
Location may be concurrently modified, refrain from optimizations that will break if it is. Cache coherence: single-variable ordering, implicit on most architectures, but requires ld.acq on Itanium.“C++14 replaced the offensive wording with hand waving”
“it's probably incorrect if you care about the result”
— Hans Boehm
Paraphrasing Hans Boehm's talk yesterday.template<class T>struct atomic;
template<>struct atomic<integral>;
template<class T>struct atomic<T*>;
template<class T>struct atomic<floating-point>; // future
All atomics operations are on atomic objects. Lifetime of memory locations, exclusively atomic or not. Holds a representation of T (alignment, etc).Guaranteed for std::atomic_flag
Roughly: is there a compare-exchange for this size?
This is what I'm focusing on.
What does non-lock-free mean?
Why a runtime value? If it's not lock-free, you probably want to use a coarse-grained lock (unless you're lazy and want the implementation's locks).struct Data { int spam, bacon, eggs; std::atomic<int> ready; }; void write(Data *d) { d->spam = 0xBEEF; d->bacon = 0xCAFE; d->eggs = 0x0000; d->ready.store(1, std::memory_order_release); } auto read(Data *d) { while (!d->ready.load(std::memory_order_acquire)) ; return std::make_tuple(d->spam, d->bacon, d->eggs); }For illustration only, this isn't great code! Only certain variables are atomic, despite the entire struct beign accessed by the two threads.
Why is the language set up this way?
Provide an abstraction for relevant hardware platforms.
Let's look at a few architectures. For more details see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html or your neighborhood compiler.Oh, one of my favorite (NOT!) pieces of code in the kernel is the implementation of the smp_read_barrier_depends() macro, which on every single architecture except for one (alpha) is a no-op.
We have basically 30 or so empty definitions for it, and I think we have something like five uses of it. One of them, I think, is performance crticial, and the reason for that macro existing.
What does it do? The semantics is that it's a read barrier between two different reads that we want to happen in order wrt two writes on the writing side (the writing side also has to have a smp_wmb() to order those writes). But the reason it isn't a simple read barrier is that the reads are actually causally *dependent*, ie we have code like
first_read = read_pointer; smp_read_barrier_depends(); second_read = *first_read;
and it turns out that on pretty much all architectures (except for alpha), the *data*dependency* will already guarantee that the CPU reads the thing in order. And because a read barrier can actually be quite expensive, we don't want to have a read barrier for this case.
But alpha? Its memory consistency is so broken that even the data dependency doesn't actually guarantee cache access order. It's strange, yes. No, it's not that alpha does some magic value prediction and can do the second read without having even done the first read first to get the address. What's actually going on is that the cache itself is unordered, and without the read barrier, you may get a stale version from the cache even if the writes were forced (by the write barrier in the writer) to happen in the right order.
— Linus Torvalds
ARM address dependency rule
LDR r1, [r0] LDR r2, [r1]
LDR r1, [r0] AND r1, r1, #0 LDR r2, [r3, r1]
Look ma, no barrier!
Value returned by read is used to compute virtual address of subsequent read or write, these two memory accesses are observed in program order. Not control dependency on ARM. See "Barrier Litmus Tests and Cookbook". See http://wg21.link/p0190 consume isn't implemented because dependencies aren't how compilers work.template <typename T> auto consumeLoad(const T* location) { using Returned = typename std::remove_cv<T>::type; struct Consumed { Returned value; size_t dependency; } ret{}; ret.value = reinterpret_cast<const volatile std::atomic<Returned>*>( location)->load(std::memory_order_relaxed); #if CPU(ARM) asm volatile("eor %[dependency], %[in], %[in]" : [dependency] "=r"(ret.dependency) : [in] "r"(ret.value) : "memory"); #else std::atomic_thread_fence(std::memory_order_acquire); ret.dependency = 0; #endif return ret; }
“looks like it probably works” — Richard Smith “(other than the aliasing violation)”
The actual code is more complicated.Same architecture, different instructions / performance.
Flags to the rescue: -march, -mcpu, -mattr
more atomic: don’t violate forward progress
less atomic: don’t add non-benign races which weren’t already present
correct programs must work under all executions an implementation is allowed to create
Keep in mind: compilers optimize code that compiler developers see and think are worth optimizing. We’re not trying to trip you!
At least as much as the hardware could do
and maybe more 😉
Is this circular, since HW adapts to the standard? Can we optimize assuming there are no races?We saw a few examples earlier
What of simulations? QEMU. DBT (with varying optimization, sometimes entertaining bringup bugs such as unrolling busy-wait loop). Virtual ISAs!void inc(std::atomic<int> *y) { *y += 1; } std::atomic<int> x; void two() { inc(&x); inc(&x); }
std::atomic<int> x; void two() { x += 2; }
Adds atomicity but cannot hinder forward progress: correct.
std::atomic<int> x; void inc(int val) { x += 1; x += val; }
_Z3inci: lock incl x(%rip) lock addl %edi, x(%rip)
Not going to win a Turing award with this: `inc r/m` versus `add r/m, r`; both with lock prefix; inc doesn’t set carry (sometimes partial flag stall); both 3 bytes; one less register; check out Agner for μops / latency, lock prefix makes this moot. Maybe the compiler would know better after all?
template<typename T> bool silly(std::atomic<T> *x, T expected, T desired) { x->compare_exchange_strong(expected, desired); // Inlined. return expected == desired; }
template<typename T> bool silly(std::atomic<T> *x, T expected, T desired) { return x->compare_exchange_strong(expected, desired); }
Generic code ends up with interesting cases such as these. bool is usually a flag… but returning it depends on ABI!
Works for any memory order but release and acq_rel
template<typename T> bool optme(std::atomic<T> *x, T desired) { T expected = desired; return x->compare_exchange_strong(expected, desired, std::memory_order_seq_cst, std::memory_order_relaxed); }
template<typename T> bool optme(std::atomic<T> *x, T desired) { return x->load(std::memory_order_seq_cst) == desired; // † }
† compiler: mark transformed load as release sequence ‡
‡ as defined in section 1.10 of the C++ standard
template<typename T> T optmetoo(std::atomic<T> *x, T y) { T z = x->load(); x->store(y); return z; }
template<typename T> T optmetoo(std::atomic<T> *x, T y) { return x->exchange(y); }
May not always pay off: architectures with weaker memory models may benefit from having write-after-read operations to the same location instead of having an atomic exchange.
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
float y; void g(float a) { y += a; }
addss y(%rip), %xmm0 movss %xmm0, y(%rip)
volatile float yv; void gv(float a) { yv += a; }
addss yv(%rip), %xmm0 movss %xmm0, yv(%rip)
std::atomic<float> x; void f(float a) { x.store(x.load(std::memory_order_relaxed) + a, std::memory_order_relaxed); }
movl x(%rip), %eax movd %eax, %xmm1 addss %xmm0, %xmm1 movd %xmm1, %eax movl %eax, x(%rip)
addss x(%rip), %xmm0 movss %xmm0, x(%rip)
Optimization FTW!
Is this correct? All relaxed doesn't guarantee visibility! Add infrequent fence? 3x faster.Inlining and constant propagation
atomic<T>::fetch_and(~(T)0) → atomic<T>::load()
Same for fetch_or(0) and fetch_xor(0)
atomic<T>::fetch_and(0) → atomic<T>::store(0)
If simple things are hard…
…is your inline assembly correct?
…will it remain correct?
Do you trust your compiler?
Sorry to say, you’re already trusting your compiler (or deluding yourself).
What if the compiler emits suboptimal code?
Read-mostly; don’t starve writers.
std::tuple<T, T> reader() { T d1, d2; unsigned seq0, seq1; do { seq0 = seq.load(std::memory_order_acquire); d1 = data1.load(std::memory_order_relaxed); d2 = data2.load(std::memory_order_relaxed); std::atomic_thread_fence(std::memory_order_acquire);seq1 = seq.load(std::memory_order_relaxed); } while (seq0 != seq1 || seq0 & 1); return {d1, d2}; }
void writer(T d1, T d2) { unsigned seq0 = seq.load(std::memory_order_relaxed); seq.store(seq0 + 1, std::memory_order_relaxed); data1.store(d1, std::memory_order_release); data2.store(d2, std::memory_order_release); seq.store(seq0 + 2, std::memory_order_release); }
Relaxed can be reordered with each other; fence is heavy on some architectures; acquire over-constrains (prevents motion into critical section); not intuitive.
We really want a release load
(there is no release load)
T d1, d2; unsigned seq0, seq1; do { seq0 = seq.load(std::memory_order_acquire); d1 = data1.load(std::memory_order_relaxed); d2 = data2.load(std::memory_order_relaxed); seq1 = seq.fetch_add(0, std::memory_order_release); } while (seq0 != seq1 || seq0 & 1);
Better for some architectures; can move reads into critical section.
mov 0x200a76(%rip),%edx # seq mov 0x200a74(%rip),%eax # data1 mov 0x200a72(%rip),%ecx # data2 # acquire fence mov 0x200a64(%rip),%esi # seq
mov 0x2004c6(%rip),%ecx # seq mov 0x2004bc(%rip),%esi # data1 xor %edx,%edx mov 0x2004af(%rip),%r8d # data2 lock xadd %edx,0x2004af(%rip) # seq RdMW
mov 0x200a46(%rip),%edx # seq mov 0x200a44(%rip),%eax # data1 mov 0x200a42(%rip),%ecx # data2 mfence # optimized RdMW! mov 0x200a31(%rip),%esi # seq
x86 `lock xadd` requires exclusive access! Can optimize to `mfence; mov`.
`mfence` is 3.5× slower. On POWER RdMW is faster but doesn't scale with number of threads.
Dead store elimination?
Strength reduction?
Careful: avoid doing so across synchronization points (other thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions). DSO: can't just prove atomic store post-dominates and aliases another to eliminate the other store.std::atomic<int> x, y; void relaxed() { x.fetch_add(1, std::memory_order_relaxed); y.fetch_add(1, std::memory_order_relaxed); x.fetch_add(1, std::memory_order_relaxed); y.fetch_add(1, std::memory_order_relaxed); }
std::atomic<int> x, y; void relaxed() { x.fetch_add(2, std::memory_order_relaxed); y.fetch_add(2, std::memory_order_relaxed); }Not aware of compilers which do this.
atomic<int> progress(0); f() { for (i = 0; i < 1000000; ++i) { // Heavy work, no shared memory... ++progress; } }
atomic<int> progress(0); f() { int my_progress = 0; // register allocated for (i = 0; i < 1000000; ++i) { // Heavy work, no shared memory... ++my_progress; } progress += my_progress; }When should compilers optimize atomics? We added non-normative wording to C++17 to discourage this.
Knows better
Backoff strategies are HW and OS specific. Atomics don't solve all problems.Initial motivation: getChar() on PDP-11, reading memory-mapped KBD_STAT I/O register
K&R: The purpose of volatile is to force an implementation to suppress optimization that could otherwise occur
-mserialize-volatile
-mnoserialize-volatile
When this option is enabled, GCC inserts MEMW instructions before volatile memory references to guarantee sequential consistency.
You can use the /volatile compiler switch to modify how the compiler interprets this keyword. If a struct member is marked as volatile, then volatile is propagated to the whole structure. If a structure does not have a length that can be copied on the current architecture by using one instruction, volatile may be completely lost on that structure.
clang emulates this with -fms-volatile
MSVC on ARM does something saner: store release, load acquire.¯\_(ツ)_/¯
¯\_(ツ)_/¯
Lock-free usually means cmpxchg available. Touches bytes more than once!No
what's that?
Both! Adds atomic as in greek to volatile. Beware lock-free: what happens to address-free with shared mmemory?by compilers I mean clang
C's _Atomic ↔ C++'s std::atomic
_sync_* builtins
_atomic_* builtins
Compiler runtime has the lock shard. Older library versions were very broken.does little besides inlining constant memory_order_*
isSimple, isVolatile, isAtomic: steer clear
similar to actual ISA instructions
Assume optimizations occur, encourage them.
Standardize common practice, enable to-the-metal optimizations.
More libraries: easy to use concurrency & parallelism; hard to get it wrong.
Drop assembly
File bugs
Talk to standards committee.
Use tooling: ThreadSanitizer.
Showcase your hardware’s strengths.
Get back to work!
and you should use atomics