No Sane Compiler Would Optimize Atomics

false

Compilers do optimize atomics, memory accesses around atomics, and utilize architecture-specific knowledge. My hobby is to encourage compilers to do more of this, programmers to rely on it, and hardware vendors to give us new atomic toys to optimize with. Oh, and standardize yet more close-to-the-metal concurrency and parallelism tools.

But, you say, surely volatile always means volatile, there’s nothing wrong with my benign races, nothing could even go wrong with non-temporal accesses, and who needs 6 memory orderings anyways‽ I’m glad you asked, let me tell you about my hobby…

Disclaimer

Peek under the hood. Prefer higher-level primitives (parallelism TS, mutex, etc).

When to go lock-free?

From Fedor Pikus' talk earlier this week "The speed of concurrency (is lock-free faster?)". Note: "does it work" has two "no" branches... That was inadvertent snark! Also see Hans Boehm's talk "Using weakly ordered C++ atomics correctly".

Quick Review

C++11 added

A memory model
Threads
Classes to communicate between threads

Threads didn't exist before C++11

Sequential Consistency

Naïve model: memory accesses are simply interleaved

Hardware doesn't work that way

Overly restrictive

Limits compiler + HW reordering / optimizations; access granularity; want to reason about atomic code regions; fences cost 2 to 200 cycles depending on architecture

SC-DRF

Data race:

Two accesses to the same memory location by different threads are not ordered
At least one of them stores to the memory location
At least one of them is not a synchronization action

Data races are UB

You should use address sanitizer

[intro.multithread]

1.10 Multi-threaded executions and data races

Atomic operations

Atomic: indivisible with respect to all other atomic accesses to that object

No tearing, no duplication, no elision.

Memory order

relaxed
consume
acquire
release
acq_rel
seq_cst

Used as constants passed to atomic operations. You can use a runtime variable, but your life will be sad. Actually a lattice, because cmpxchg, though we may change that in the standard, see http://wg21.link/p0418 when published.

relaxed

racing accesses

indivisible

enforce cache coherency

doesn't guarantee visibility

Location may be concurrently modified, refrain from optimizations that will break if it is. Cache coherence: single-variable ordering, implicit on most architectures, but requires ld.acq on Itanium.

relaxed

“C++14 replaced the offensive wording with hand waving”

“it's probably incorrect if you care about the result”

— Hans Boehm

Paraphrasing Hans Boehm's talk yesterday.

Atomic

template<class T>struct atomic;

template<>struct atomic<integral>;

template<class T>struct atomic<T*>;

template<class T>struct atomic<floating-point>; // future

All atomics operations are on atomic objects. Lifetime of memory locations, exclusively atomic or not. Holds a representation of T (alignment, etc).

Atomic<T>

load
store
exchange
compare_exchange_{weak,strong}
is_lock_free
static constexpr is_always_lock_free

Must be trivially copyable; can be structs, but watch out for padding bits! All have atomic ordering (except lock free).

Atomic<integral>

fetch_add
fetch_sub
fetch_and
fetch_or
fetch_xor

Some shorthands (without ordering) and coercions; integers are two's complement (no UB!).

Atomic<T*>

fetch_add
fetch_sub

Atomic<floating-point>

fetch_add
fetch_sub
More?

What can hardware do? FMA? Need cmpxchg? What of FP env (round, denorm, etc)? What's useful? See http://wg21.link/p0020

Fences

atomic_thread_fence
atomic_signal_fence

Signal fence is basically a compiler barrier. Exist to order atomic operations that don't enforce ordering. On x86: compiler fences.

Lock-freedom

Guaranteed for std::atomic_flag

Roughly: is there a compare-exchange for this size?

This is what I'm focusing on.

What does non-lock-free mean?

Why a runtime value? If it's not lock-free, you probably want to use a coarse-grained lock (unless you're lazy and want the implementation's locks).

Simple usage

struct Data { int spam, bacon, eggs; std::atomic<int> ready; };
void write(Data *d) {
  d->spam = 0xBEEF;
  d->bacon = 0xCAFE;
  d->eggs = 0x0000;
  d->ready.store(1, std::memory_order_release);
}
auto read(Data *d) {
  while (!d->ready.load(std::memory_order_acquire)) ;
  return std::make_tuple(d->spam, d->bacon, d->eggs);
}

For illustration only, this isn't great code! Only certain variables are atomic, despite the entire struct beign accessed by the two threads.

We have our tools!

Why is the language set up this way?

Provide an abstraction for relevant hardware platforms.

Let's look at a few architectures. For more details see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html or your neighborhood compiler.

x86-64

load relaxed MOV load consume MOV load acquire MOV load seq_cst MOV store relaxed MOV store release MOV store seq_cst LOCK XCHG non-seq_cst fence # free! seq_cst fence MFENCE Not mentioning RMW / cmpxchg. Why the fence? Loads may be reordered with older stores to different locations.

Power

load relaxed ld load consume ld + dependencies load acquire ld; cmp; bc; isync load seq_cst hwsync; ld; cmp; bc; isync store relaxed st store release lwsync; st store seq_cst hwsync; st cmpxchg relaxed _loop: lwarx; cmp; bc _exit; stwcx.; bc _loop; _exit: non-seq_cst fence lwsync seq_cst fence hwsync cmpxchg is ll / sc, more variants missing, RMW uses ll / sc. Not showing 64-bit.

ARMv7

load relaxed ldr load consume ldr + dependencies load acquire ldr; dmb load seq_cst ldr; dmb store relaxed str store release dmb; str store seq_cst dmb; str; dmb cmpxchg relaxed _loop: ldrex; mov 0; teq; strexeq; teq 0; bne _loop fences dmb Similar to Power, same for MIPS. 64-bit is slightly horrible; has ldrexd; sometimes ldrd is single-copy atomic. Many dmbs. "dmb sy" is system-wide (too strong for C++), "dmb ish" is often what you want. ARM has other barrier instructions as well!

ARMv8 A64

load relaxed ldr load consume ldr + dependencies load acquire ldar load seq_cst ldar store relaxed str store release stlr store seq_cst stlr cmpxchg relaxed _loop: ldxr; cmp; bne _exit; stxr; cbz _loop; _exit fences dmb Life imitates standards! Added "acquire" / "release" primitives, even to aarch32.

Itanium

load relaxed ld.acq load consume ld.acq load acquire ld.acq load seq_cst ld.acq store relaxed st.rel store release st.rel store seq_cst st.rel; mfB cmpxchg acquire cmpxchg.acq non-seq_cst fence # free! seq_cst fence mf

Alpha

Oh, one of my favorite (NOT!) pieces of code in the kernel is the implementation of the smp_read_barrier_depends() macro, which on every single architecture except for one (alpha) is a no-op.

We have basically 30 or so empty definitions for it, and I think we have something like five uses of it. One of them, I think, is performance crticial, and the reason for that macro existing.

What does it do? The semantics is that it's a read barrier between two different reads that we want to happen in order wrt two writes on the writing side (the writing side also has to have a smp_wmb() to order those writes). But the reason it isn't a simple read barrier is that the reads are actually causally *dependent*, ie we have code like

      first_read = read_pointer;
      smp_read_barrier_depends();
      second_read = *first_read;

and it turns out that on pretty much all architectures (except for alpha), the *data*dependency* will already guarantee that the CPU reads the thing in order. And because a read barrier can actually be quite expensive, we don't want to have a read barrier for this case.

But alpha? Its memory consistency is so broken that even the data dependency doesn't actually guarantee cache access order. It's strange, yes. No, it's not that alpha does some magic value prediction and can do the second read without having even done the first read first to get the address. What's actually going on is that the cache itself is unordered, and without the read barrier, you may get a stale version from the cache even if the writes were forced (by the write barrier in the writer) to happen in the right order.

— Linus Torvalds

Hard to implement C++ efficiently on Alpha! Exception that proves the rule?

consume

ARM address dependency rule

LDR r1, [r0]
LDR r2, [r1]

LDR r1, [r0]
AND r1, r1, #0
LDR r2, [r3, r1]

Look ma, no barrier!

Value returned by read is used to compute virtual address of subsequent read or write, these two memory accesses are observed in program order. Not control dependency on ARM. See "Barrier Litmus Tests and Cookbook". See http://wg21.link/p0190 consume isn't implemented because dependencies aren't how compilers work.

consume maybe?

template <typename T> auto consumeLoad(const T* location) {
    using Returned = typename std::remove_cv<T>::type;
    struct Consumed { Returned value; size_t dependency; } ret{};
    ret.value = reinterpret_cast<const volatile std::atomic<Returned>*>(
          location)->load(std::memory_order_relaxed);
#if CPU(ARM)
    asm volatile("eor %[dependency], %[in], %[in]"
                 : [dependency] "=r"(ret.dependency)
                 : [in] "r"(ret.value)
                 : "memory");
#else
    std::atomic_thread_fence(std::memory_order_acquire);
    ret.dependency = 0;
#endif
    return ret;
}

“looks like it probably works” — Richard Smith “(other than the aliasing violation)”

The actual code is more complicated.

Architecture-specific

Same architecture, different instructions / performance.

Flags to the rescue: -march, -mcpu, -mattr

Special instructions
what's the best spinloop?
nominally incorrect code
Workaround CPU bugs

Special: Intel RTM / HLE. Incorrect: on Apple Swift processor DMB ISH versus DMB ISHST (only wait for stores); some arch doesn't need barrier at all! AMD CPU bug for Chrome: barrier bug in AMD errata 147, Opterons 1xx/2xx/8xx, 2003-2004, 0.02% of Chrome users, compiler workaround.

Takeaways

C++ exposes hardware portably
Hardware changes over time
Assembly operations operation on memory locations
C++ atomics operate on datastructures

Portability is hard. Compiler should know best. C++ is higher level than asm. Operating on data instead of code is contended.

How do we optimize atomics?

as-if

more atomic: don’t violate forward progress

less atomic: don’t add non-benign races which weren’t already present

Put another way

correct programs must work under all executions an implementation is allowed to create

Keep in mind: compilers optimize code that compiler developers see and think are worth optimizing. We’re not trying to trip you!

What can the compiler do?

At least as much as the hardware could do

and maybe more 😉

Is this circular, since HW adapts to the standard? Can we optimize assuming there are no races?

What does hardware do?

We saw a few examples earlier

What of simulations? QEMU. DBT (with varying optimization, sometimes entertaining bringup bugs such as unrolling busy-wait loop). Virtual ISAs!

Simple example

void inc(std::atomic<int> *y) {
  *y += 1;
}
std::atomic<int> x;
void two() {
  inc(&x);
  inc(&x);
}

std::atomic<int> x;
void two() {
  x += 2;
}

Adds atomicity but cannot hinder forward progress: correct.

Similar example

std::atomic<int> x;
void inc(int val) {
  x += 1;
  x += val;
}

_Z3inci:
  lock incl x(%rip)
  lock addl %edi, x(%rip)

Not going to win a Turing award with this: `inc r/m` versus `add r/m, r`; both with lock prefix; inc doesn’t set carry (sometimes partial flag stall); both 3 bytes; one less register; check out Agner for μops / latency, lock prefix makes this moot. Maybe the compiler would know better after all?

Opportunities through inlining

template<typename T>
bool silly(std::atomic<T> *x, T expected, T desired) {
  x->compare_exchange_strong(expected, desired); // Inlined.
  return expected == desired;
}

template<typename T>
bool silly(std::atomic<T> *x, T expected, T desired) {
  return x->compare_exchange_strong(expected, desired);
}

Generic code ends up with interesting cases such as these. bool is usually a flag… but returning it depends on ABI!

A tricky one

Works for any memory order but release and acq_rel

template<typename T>
bool optme(std::atomic<T> *x, T desired) {
  T expected = desired;
  return x->compare_exchange_strong(expected, desired,
    std::memory_order_seq_cst, std::memory_order_relaxed);
}

template<typename T>
bool optme(std::atomic<T> *x, T desired) {
  return x->load(std::memory_order_seq_cst) == desired; // †
}

† compiler: mark transformed load as release sequence ‡

‡ as defined in section 1.10 of the C++ standard

Stronger, faster

template<typename T>
T optmetoo(std::atomic<T> *x, T y) {
  T z = x->load();
  x->store(y);
  return z;
}

template<typename T>
T optmetoo(std::atomic<T> *x, T y) {
  return x->exchange(y);
}

better?

May not always pay off: architectures with weaker memory models may benefit from having write-after-read operations to the same location instead of having an atomic exchange.

Interesting problem...

Read ; Modify ; Write

Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Racy

float y;
void g(float a) { y += a; }

        addss   y(%rip), %xmm0
        movss   %xmm0, y(%rip)

volatile

volatile float yv;
void gv(float a) { yv += a; }

        addss   yv(%rip), %xmm0
        movss   %xmm0, yv(%rip)

Correct?

std::atomic<float> x;
void f(float a) {
  x.store(x.load(std::memory_order_relaxed) + a,
          std::memory_order_relaxed);
}

        movl    x(%rip), %eax
        movd    %eax, %xmm1
        addss   %xmm0, %xmm1
        movd    %xmm1, %eax
        movl    %eax, x(%rip)

        addss   x(%rip), %xmm0
        movss   %xmm0, x(%rip)

Optimization FTW!

Is this correct? All relaxed doesn't guarantee visibility! Add infrequent fence? 3x faster.

Moar!

Inlining and constant propagation

atomic<T>::fetch_and(~(T)0) → atomic<T>::load()

Same for fetch_or(0) and fetch_xor(0)

atomic<T>::fetch_and(0) → atomic<T>::store(0)

Takeaway

If simple things are hard…

…is your inline assembly correct?

…will it remain correct?

Do you trust your compiler?

Sorry to say, you’re already trusting your compiler (or deluding yourself).

What if the compiler emits suboptimal code?

File a bug!

Sequence lock

Get ticket number
Get the data
Check the ticket again
- If odd: write was happening
- If different: write occurred
- If same: no write occurred, data is good
If data isn’t good: try again

Read-mostly; don’t starve writers.

std::tuple<T, T> reader() {
  T d1, d2; unsigned seq0, seq1;
  do {
    seq0 = seq.load(std::memory_order_acquire);
    d1 = data1.load(std::memory_order_relaxed);
    d2 = data2.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);seq1 = seq.load(std::memory_order_relaxed);
  } while (seq0 != seq1 || seq0 & 1);
  return {d1, d2};
}

void writer(T d1, T d2) {
  unsigned seq0 = seq.load(std::memory_order_relaxed);
  seq.store(seq0 + 1, std::memory_order_relaxed);
  data1.store(d1, std::memory_order_release);
  data2.store(d2, std::memory_order_release);
  seq.store(seq0 + 2, std::memory_order_release);
}

Relaxed can be reordered with each other; fence is heavy on some architectures; acquire over-constrains (prevents motion into critical section); not intuitive.

We really want a release load

(there is no release load)

Read-don’t-modify-write

T d1, d2;
unsigned seq0, seq1;
do {
  seq0 = seq.load(std::memory_order_acquire);
  d1 = data1.load(std::memory_order_relaxed);
  d2 = data2.load(std::memory_order_relaxed);
  seq1 = seq.fetch_add(0, std::memory_order_release);
} while (seq0 != seq1 || seq0 & 1);

Better for some architectures; can move reads into critical section.

 mov    0x200a76(%rip),%edx        # seq
 mov    0x200a74(%rip),%eax        # data1
 mov    0x200a72(%rip),%ecx        # data2
# acquire fence
 mov    0x200a64(%rip),%esi        # seq

 mov    0x2004c6(%rip),%ecx        # seq
 mov    0x2004bc(%rip),%esi        # data1
 xor    %edx,%edx
 mov    0x2004af(%rip),%r8d        # data2
 lock xadd %edx,0x2004af(%rip)     # seq RdMW

 mov    0x200a46(%rip),%edx        # seq
 mov    0x200a44(%rip),%eax        # data1
 mov    0x200a42(%rip),%ecx        # data2
 mfence                            # optimized RdMW!
 mov    0x200a31(%rip),%esi        # seq

x86 `lock xadd` requires exclusive access! Can optimize to `mfence; mov`.

What's better?

`mfence` is 3.5× slower. On POWER RdMW is faster but doesn't scale with number of threads.

"Normal" compiler optimizations

Dead store elimination?

Strength reduction?

Careful: avoid doing so across synchronization points (other thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions). DSO: can't just prove atomic store post-dominates and aliases another to eliminate the other store.

relaxed optimization

std::atomic<int> x, y;
void relaxed() {
  x.fetch_add(1, std::memory_order_relaxed);
  y.fetch_add(1, std::memory_order_relaxed);
  x.fetch_add(1, std::memory_order_relaxed);
  y.fetch_add(1, std::memory_order_relaxed);
}

std::atomic<int> x, y;
void relaxed() {
  x.fetch_add(2, std::memory_order_relaxed);
  y.fetch_add(2, std::memory_order_relaxed);
}

Not aware of compilers which do this.

Progress bar

atomic<int> progress(0);
f() {
    for (i = 0; i < 1000000; ++i) {
        // Heavy work, no shared memory...
        ++progress;
    }
}

atomic<int> progress(0);
f() {
    int my_progress = 0; // register allocated
    for (i = 0; i < 1000000; ++i) {
        // Heavy work, no shared memory...
        ++my_progress;
    }
    progress += my_progress;
}

When should compilers optimize atomics? We added non-normative wording to C++17 to discourage this.

Disabling reordering / fusing?

asm volatile("":::"memory"); // eek!
std::atomic_signal_fence(); // wat?
Use volatile? NO!
Don't use relaxed?
Synchronize-with?

Moar!

Tag "non-atomic" functions, optimize around them
Interference-free regions
Optimize fence positioning
Non-escaping thread local (TLS, stack)
etc.

Compilers already know about "read-only" functions.

std::mutex for sanity

easier to use correctly
in theory...
- it could still get optimized
- lock: acquire; unlock: release
pthread and kernel call makes this hard
std::synchronic to the rescue!

std::mutex

Knows better

Backoff strategies are HW and OS specific. Atomics don't solve all problems.

volatile

Initial motivation: getChar() on PDP-11, reading memory-mapped KBD_STAT I/O register

K&R: The purpose of volatile is to force an implementation to suppress optimization that could otherwise occur

What does that mean?

Prevent load / store motion?
Prevent arithmetic motion?
Hardware fence? x86 MFENCE? ARM DMB SY?
Shared memory?

When shared memory is used and the other side is untrusted / adversarial / POB (Plain Old Buggy). c.f. Xen exploit.

It gets better

Xtensa GCC options

-mserialize-volatile

-mnoserialize-volatile

When this option is enabled, GCC inserts MEMW instructions before volatile memory references to guarantee sequential consistency.

MSVC

You can use the /volatile compiler switch to modify how the compiler interprets this keyword. If a struct member is marked as volatile, then volatile is propagated to the whole structure. If a structure does not have a length that can be copied on the current architecture by using one instruction, volatile may be completely lost on that structure.

clang emulates this with -fms-volatile

MSVC on ARM does something saner: store release, load acquire.

What does it mean‽

¯\_(ツ)_/¯

can tear
each byte touched exactly once
no machine-level ordering
enforces abstract machine ordering with other volatile
can't duplicate
can't elide

Highly compiler + HW dependent. Not "atomic" as in greek.

Can volatile be lock-free?

¯\_(ツ)_/¯

Lock-free usually means cmpxchg available. Touches bytes more than once!

Can volatile do RMW?

Are there portable volatile uses?

mark local variables in the same scope as a setjmp, preserve value across longjmp
externally modified: physical memory mapped in multiple virtual addresses
volatile sigatomic_t may be used to communicate with a signal handler in the same thread

THis is pretty debatable! Note for signals: need Plain Old Function, very arbitrary, will change in C++17.

volatile atomic

what's that?

Both! Adds atomic as in greek to volatile. Beware lock-free: what happens to address-free with shared mmemory?

What do compilers do?

by compilers I mean clang

Standard library

C's _Atomic ↔ C++'s std::atomic

_sync_* builtins

_atomic_* builtins

Compiler runtime has the lock shard. Older library versions were very broken.

Front-end

does little besides inlining constant memory_order_*

Middle-end

load atomic [volatile] <ty>, <ty>* <pointer> [singlethread] <ordering>, align <alignment>
store atomic [volatile] <ty> <value>, <ty>* <pointer> [singlethread] <ordering>, align <alignment>

Middle-end

atomicrmw [volatile] <operation> <ty>* <pointer>, <ty> <value> [singlethread] <ordering>
cmpxchg [weak] [volatile] <ty>* <pointer>, <ty> <cmp>, <ty> <new> [singlethread] <success ordering> <failure ordering>
fence [singlethread] <ordering>

IR

isSimple, isVolatile, isAtomic: steer clear

Machine IR

similar to actual ISA instructions

Now what?

Standards Committee

Assume optimizations occur, encourage them.

Standardize common practice, enable to-the-metal optimizations.

More libraries: easy to use concurrency & parallelism; hard to get it wrong.

Developers

Drop assembly

File bugs

Talk to standards committee.

Use tooling: ThreadSanitizer.

Hardware vendors

Showcase your hardware’s strengths.

Compiler writers

Get back to work!

Sane Compilers Should Optimize Atomics

and you should use atomics

github.com/jfbastien/no-sane-compiler

@jfbastien

No Sane Compiler Would Optimize Atomics

jfbastien

No Sane Compiler Would Optimize Atomics

5 89 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

no-sane-compiler

No Sane Compiler Would Optimize Atomics

Disclaimer

When to go lock-free?

Quick Review

C++11 added

Sequential Consistency

SC-DRF

[intro.multithread]

Atomic operations

Memory order

relaxed

relaxed

Atomic

Atomic<T>

Atomic<integral>

Atomic<T*>

Atomic<floating-point>

Fences

Lock-freedom

Simple usage

We have our tools!

x86-64

Power

ARMv7

ARMv8 A64

Itanium

Alpha

consume

consume maybe?

Architecture-specific

Takeaways

How do we optimize atomics?

as-if

Put another way

What can the compiler do?

What does hardware do?

Simple example

Similar example

Opportunities through inlining

A tricky one

Stronger, faster

better?

Interesting problem...

Read ; Modify ; Write

Racy

volatile

Correct?

Moar!

Takeaway

File a bug!

Sequence lock

Read-don’t-modify-write

What's better?

"Normal" compiler optimizations

relaxed optimization

Progress bar

Disabling reordering / fusing?

Moar!

std::mutex for sanity

std::mutex

volatile

What does that mean?

It gets better

Xtensa GCC options

MSVC

What does it mean‽

Can volatile be lock-free?

Can volatile do RMW?

Are there portable volatile uses?

volatile atomic

What do compilers do?

Standard library

Front-end

Middle-end

Middle-end

5 89