问题:
我正在寻找清除无符号原子的最低非零位的最佳方法,如std::atomic_uint64_t
线程安全方式,而不使用额外的互斥锁等.另外,我还需要知道,哪个位被清除了.
示例:
假设,如果存储的当前值是,0b0110
我想知道最低的非零位是位1(0索引)并将变量设置为0b0100
.
我想出了最好的版本是这样:
#include#include inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){ return v-1 & v; } inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){ return v & ~with_lowest_non_zero_cleared(v); } uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo) { auto expected = foo.load(std::memory_order_relaxed); while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed)) {;} return only_keep_lowest_non_zero(expected); }
有更好的解决方案吗?
笔记:
它不一定是最低的非零位 - 我也很满意清除最高位或设置最高/最低"零位"的解决方案,如果这有所不同
我非常喜欢标准的c ++(17)版本,但会接受一个使用内在函数来表达clang和msvc的答案
它应该是可移植的(使用clang或msvc编译x64和AArch64),但我最感兴趣的是使用clang编译时最近的intel和AMD x64架构的性能.
编辑:更新必须是原子的,必须保证全局进度,但正如上面的解决方案,它不一定是一个等待免费的算法(当然,如果你能给我一个,我会很高兴).
Peter Cordes.. 7
There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr
is only available in memory-source form, not memory-destination form; lock blsr [shared_var]
does not exist. (Compilers know how to optimize (var-1) & (var)
into blsr
for local vars when you compile with -march=haswell
or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.
你在x86上做的最好的就是lock cmpxchg
你在问题中提出的重试循环.这比在旧版本的变量中找到正确的位然后使用更好的选择lock btr
,特别是如果在位扫描和位之间设置较低的位来清除错误的位将是正确性问题lock btr
.如果另一个线程已经清除了你想要的位,你仍然需要一个重试循环.
(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add
where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)
CPU失去高速缓存行的窗口以获取expected
和运行负载之间的高速缓存行的窗口lock cmpxchg
很少. 通常它会在第一次通过时成功,因为cmpxchg
运行时缓存行仍然会出现在L1d缓存中.当高速缓存行到达时,它(希望)已经处于MESI Exclusive状态,如果CPU看到足够远,可以为它做RFO.
您可以检测cmpxchg重试循环以查看实际程序中实际获得的争用程度.(例如,通过递增循环中的局部并在成功后使用原子松弛+=
到共享计数器,或者使用线程局部计数器.)
Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.
EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.
如果缓存失败,CAS会保留未修改的缓存行.在x86上它会弄脏缓存行(MESI M状态),但是x86 cmpxchg没有检测到ABA,它只进行比较,因此加载相同的另一个线程expected
将成功.在LL/SC机器上,希望一个核心上的SC故障不会导致其他核心上的SC故障,否则它可能会活锁.这是你可以假设CPU架构师想到的东西.
它不是等待的,因为单个线程理论上不得不重试无限次.
你的代码编译gcc8.1 -O3 -march=haswell
成这个asm(来自Godbolt编译器资源管理器)
# gcc's code-gen for x86 with BMI1 looks optimal to me. No wasted instructions! # presumably you'll get something similar when inlining. pop_lowest_non_zero(std::atomic&): mov rax, QWORD PTR [rdi] .L2: blsr rdx, rax # Bit Lowest Set Reset lock cmpxchg QWORD PTR [rdi], rdx jne .L2 # fall through on success: cmpxchg sets ZF on equal like regular cmp blsi rax, rax # Bit Lowest Set Isolate ret
没有BMI1,blsr和blsi各自变成两条指令.考虑到lock
ed指令的成本,这几乎与性能无关.
clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely()
macro. How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).
Footnote 1:
除非您为一组已知的机器构建二进制文件,否则无论如何都不能假设BMI1可用.因此编译器需要执行类似lea rdx, [rax-1]
/ and rdx, rax
而不是代码的操作blsr rdx, rax
.这对于这个功能来说是微不足道的.即使在非竞争情况下,大部分成本也是原子操作,即使对于热插入高速缓存也没有争用情况,查看无序执行吞吐量对周围代码的影响.(例如lock cmpxchg
,Skylake(http://agner.org/optimize/)上的10 uops 与使用时保存1 uop相比blsr
而不是其他2个指令.假设前端是瓶颈,而不是内存或其他东西.作为一个完整的内存屏障也会对周围代码中的加载/存储产生影响,但幸运的是,不会出现独立ALU指令的无序执行.)
Most non-x86 machines do all their atomic operations with load-linked/store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x
inside an LL/SC instead of just the add that you'd get from using fetch_add
), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)
So you just have to write code using compare_exchange_weak
if C++ doesn't provide the operation you want (e.g. fetch_or
or fetch_and
).
您从当前编译器中获得的实践是compare_exchange_weak
使用LL/SC实现的,并且您的算术运算与此分开.asm实际上在exclusive-load-acquire(ldaxr
)之前执行轻松加载,而不是仅仅根据ldaxr
结果进行计算.并且还有额外的分支来检查expected
上次尝试是否与尝试存储之前的新加载结果相匹配.
LL/SC窗口可能比加载和存储之间的2个相关ALU指令短,以防万一.CPU提前准备好所需的值,而不依赖于负载结果.(这取决于之前的加载结果.)Clang的code-gen将sub
/ and
置于循环内部,但依赖于前一次迭代的加载,因此无序执行时它们仍然可以提前准备好.
但如果这确实是最有效的方法,那么编译器也应该使用它fetch_add
.他们没有,因为它可能不是.LL/SC重试很少见,就像x86上的CAS重试一样.我不知道它是否可以在非常高争用的情况下有所不同.也许,但编译器在编译时不会减慢快速路径以进行优化fetch_add
.
我重命名了你的函数并重新格式化了while()
for的可读性,因为对于一行没有包装它已经太长了unlikely()
.
This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.
#include#include #ifdef __GNUC__ #define unlikely(expr) __builtin_expect(!!(expr), 0) #define likely(expr) __builtin_expect(!!(expr), 1) #else #define unlikely(expr) (expr) #define likely(expr) (expr) #endif inline uint64_t clear_lowest_set(std::uint64_t v){ return v-1 & v; } inline uint64_t isolate_lowest_set(std::uint64_t v){ //return v & ~clear_lowest_set(v); return (-v) & v; // MSVC optimizes this better for ARM when used separately. // The other way (in terms of clear_lowest_set) does still CSE to 2 instructions // when the clear_lowest_set result is already available } uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo) { auto expected = foo.load(std::memory_order_relaxed); while(unlikely(!foo.compare_exchange_weak( expected, clear_lowest_set(expected), std::memory_order_seq_cst, std::memory_order_relaxed))) {} return isolate_lowest_set(expected); }
Clang -O3
for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57
on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.
pop_lowest_non_zero(std::__1::atomic&): // @pop_lowest_non_zero(std::__1::atomic &) ldr x9, [x0] @ the relaxed load ldaxr x8, [x0] @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr) cmp x8, x9 @ compare part of the CAS b.ne .LBB0_3 sub x10, x9, #1 and x10, x10, x9 @ clear_lowest( relaxed load result) stlxr w11, x10, [x0] @ the CAS store (sequential-release) cbnz w11, .LBB0_4 @ if(SC failed) goto retry loop # else fall through and eventually reach the epilogue # this looks insane. w10 = 0|1, then branch if bit[0] != 0. Always taken, right? orr w10, wzr, #0x1 tbnz w10, #0, .LBB0_5 @ test-bit not-zero will always jump to the epilogue b .LBB0_6 # never reached .LBB0_3: clrex @ clear the ldrx/stxr transaction .LBB0_4: # This block is pointless; just should b to .LBB0_6 mov w10, wzr tbz w10, #0, .LBB0_6 # go to the retry loop, below the ret (not shown here) .LBB0_5: # isolate_lowest_set and return neg x8, x9 and x0, x9, x8 ret .LBB0_6: @ the retry loop. Only reached if the compare or SC failed. ...
All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173](https://bugs.llvm.org/show_bug.cgi?id=38173)
It seems MSVC doesn't know that AArch64's release-stores are sequential-release (not just regular release like x86), because it's still using a dmb ish
instruction (full memory barrier) after stlxr
. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak
compiles like compare_exchange_strong
with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure. That will usually be because another thread modified the line, so expected will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb
)
## MSVC Pre 2018 -Ox |pop_lowest_non_zero| PROC ldr x10,[x0] # x10 = expected = foo.load(relaxed) |$LL2@pop_lowest| @ L2 # top of the while() loop sub x8,x10,#1 and x11,x8,x10 # clear_lowest(relaxed load result) |$LN76@pop_lowest| @ LN76 ldaxr x8,[x0] cmp x8,x10 # the compare part of CAS bne |$LN77@pop_lowest| stlxr w9,x11,[x0] cbnz w9,|$LN76@pop_lowest| # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong # fall through on LL/SC success |$LN77@pop_lowest| @ LN77 dmb ish # full memory barrier: slow cmp x8,x10 # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths. beq |$LN75@pop_lowest| # if(expected matches) goto epilogue mov x10,x8 # else update expected b |$LL2@pop_lowest| # and goto the top of the while loop |$LN75@pop_lowest| @ LN75 # function epilogue neg x8,x10 and x0,x8,x10 ret
gcc6.3 for AArch64 makes weird code, too, storing expected
to the stack. (Godbolt doesn't have newer AArch64 gcc).
I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.
pop_lowest_non_zero ## hand written based on clang # clang could emit this even without turning CAS into LL/SC directly ldr x9, [x0] @ x9 = expected = foo.load(relaxed) .Lcas_retry: ldaxr x8, [x0] @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr) cmp x8, x9 @ compare part of the CAS b.ne .Lcas_fail sub x10, x9, #1 and x10, x10, x9 @ clear_lowest (relaxed load result) stlxr w11, x10, [x0] @ the CAS store (sequential-release) cbnz w11, .Lllsc_fail # LL/SC success: isolate_lowest_set and return .Lepilogue: neg x8, x9 and x0, x9, x8 ret .Lcas_fail: clrex @ clear the ldrx/stxr transaction .Lllsc_fail: mov x9, x8 @ update expected b .Lcas_retry @ instead of duplicating the loop, jump back to the existing one with x9 = expected
.fetch_add
:Clang does make nice code for fetch_add()
:
mov x8, x0 .LBB1_1: ldxr x0, [x8] # relaxed exclusive load: I used mo_release add x9, x0, #1 stlxr w10, x9, [x8] cbnz w10, .LBB1_1 # retry if LL/SC failed ret
Instead of add #1
, we'd like to have sub x9, x8, #1
/and x9, x9, x8
, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.
Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) <<(64-counter)
(for non-zero counter only). Or maybe tmp=1ULL <
xor eax,eax
/bts rax, rdi
(rax=1 set bit at position 0..63)/blsmsk rax, rax
(rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.
Or if there's some pattern to the bitmap, x86 BMI2 pdep
can scatter contiguous bits into that pattern. Get N contiguous set bits with (1ULL <
Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?
No idea what your use-case is, but this kind of idea might be useful:
Do you need a single atomic bitmap that all threads have to contend for? Perhaps you could break it into multiple chunks, each in a separate cache line. (But that makes it impossible to atomically snapshot the entire map.) Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.
In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg
on it. That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit. But it could reduce retries between this and something that sets other bits in the qword. Of course this only works if it's not a correctness problem for lock cmpxchg
to succeed when other bytes of the qword have changed.
There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr
is only available in memory-source form, not memory-destination form; lock blsr [shared_var]
does not exist. (Compilers know how to optimize (var-1) & (var)
into blsr
for local vars when you compile with -march=haswell
or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.
你在x86上做的最好的就是lock cmpxchg
你在问题中提出的重试循环.这比在旧版本的变量中找到正确的位然后使用更好的选择lock btr
,特别是如果在位扫描和位之间设置较低的位来清除错误的位将是正确性问题lock btr
.如果另一个线程已经清除了你想要的位,你仍然需要一个重试循环.
(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add
where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)
CPU失去高速缓存行的窗口以获取expected
和运行负载之间的高速缓存行的窗口lock cmpxchg
很少. 通常它会在第一次通过时成功,因为cmpxchg
运行时缓存行仍然会出现在L1d缓存中.当高速缓存行到达时,它(希望)已经处于MESI Exclusive状态,如果CPU看到足够远,可以为它做RFO.
您可以检测cmpxchg重试循环以查看实际程序中实际获得的争用程度.(例如,通过递增循环中的局部并在成功后使用原子松弛+=
到共享计数器,或者使用线程局部计数器.)
Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.
EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.
如果缓存失败,CAS会保留未修改的缓存行.在x86上它会弄脏缓存行(MESI M状态),但是x86 cmpxchg没有检测到ABA,它只进行比较,因此加载相同的另一个线程expected
将成功.在LL/SC机器上,希望一个核心上的SC故障不会导致其他核心上的SC故障,否则它可能会活锁.这是你可以假设CPU架构师想到的东西.
它不是等待的,因为单个线程理论上不得不重试无限次.
你的代码编译gcc8.1 -O3 -march=haswell
成这个asm(来自Godbolt编译器资源管理器)
# gcc's code-gen for x86 with BMI1 looks optimal to me. No wasted instructions! # presumably you'll get something similar when inlining. pop_lowest_non_zero(std::atomic&): mov rax, QWORD PTR [rdi] .L2: blsr rdx, rax # Bit Lowest Set Reset lock cmpxchg QWORD PTR [rdi], rdx jne .L2 # fall through on success: cmpxchg sets ZF on equal like regular cmp blsi rax, rax # Bit Lowest Set Isolate ret
没有BMI1,blsr和blsi各自变成两条指令.考虑到lock
ed指令的成本,这几乎与性能无关.
clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely()
macro. How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).
Footnote 1:
除非您为一组已知的机器构建二进制文件,否则无论如何都不能假设BMI1可用.因此编译器需要执行类似lea rdx, [rax-1]
/ and rdx, rax
而不是代码的操作blsr rdx, rax
.这对于这个功能来说是微不足道的.即使在非竞争情况下,大部分成本也是原子操作,即使对于热插入高速缓存也没有争用情况,查看无序执行吞吐量对周围代码的影响.(例如lock cmpxchg
,Skylake(http://agner.org/optimize/)上的10 uops 与使用时保存1 uop相比blsr
而不是其他2个指令.假设前端是瓶颈,而不是内存或其他东西.作为一个完整的内存屏障也会对周围代码中的加载/存储产生影响,但幸运的是,不会出现独立ALU指令的无序执行.)
Most non-x86 machines do all their atomic operations with load-linked/store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x
inside an LL/SC instead of just the add that you'd get from using fetch_add
), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)
So you just have to write code using compare_exchange_weak
if C++ doesn't provide the operation you want (e.g. fetch_or
or fetch_and
).
您从当前编译器中获得的实践是compare_exchange_weak
使用LL/SC实现的,并且您的算术运算与此分开.asm实际上在exclusive-load-acquire(ldaxr
)之前执行轻松加载,而不是仅仅根据ldaxr
结果进行计算.并且还有额外的分支来检查expected
上次尝试是否与尝试存储之前的新加载结果相匹配.
LL/SC窗口可能比加载和存储之间的2个相关ALU指令短,以防万一.CPU提前准备好所需的值,而不依赖于负载结果.(这取决于之前的加载结果.)Clang的code-gen将sub
/ and
置于循环内部,但依赖于前一次迭代的加载,因此无序执行时它们仍然可以提前准备好.
但如果这确实是最有效的方法,那么编译器也应该使用它fetch_add
.他们没有,因为它可能不是.LL/SC重试很少见,就像x86上的CAS重试一样.我不知道它是否可以在非常高争用的情况下有所不同.也许,但编译器在编译时不会减慢快速路径以进行优化fetch_add
.
我重命名了你的函数并重新格式化了while()
for的可读性,因为对于一行没有包装它已经太长了unlikely()
.
This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.
#include#include #ifdef __GNUC__ #define unlikely(expr) __builtin_expect(!!(expr), 0) #define likely(expr) __builtin_expect(!!(expr), 1) #else #define unlikely(expr) (expr) #define likely(expr) (expr) #endif inline uint64_t clear_lowest_set(std::uint64_t v){ return v-1 & v; } inline uint64_t isolate_lowest_set(std::uint64_t v){ //return v & ~clear_lowest_set(v); return (-v) & v; // MSVC optimizes this better for ARM when used separately. // The other way (in terms of clear_lowest_set) does still CSE to 2 instructions // when the clear_lowest_set result is already available } uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo) { auto expected = foo.load(std::memory_order_relaxed); while(unlikely(!foo.compare_exchange_weak( expected, clear_lowest_set(expected), std::memory_order_seq_cst, std::memory_order_relaxed))) {} return isolate_lowest_set(expected); }
Clang -O3
for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57
on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.
pop_lowest_non_zero(std::__1::atomic&): // @pop_lowest_non_zero(std::__1::atomic &) ldr x9, [x0] @ the relaxed load ldaxr x8, [x0] @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr) cmp x8, x9 @ compare part of the CAS b.ne .LBB0_3 sub x10, x9, #1 and x10, x10, x9 @ clear_lowest( relaxed load result) stlxr w11, x10, [x0] @ the CAS store (sequential-release) cbnz w11, .LBB0_4 @ if(SC failed) goto retry loop # else fall through and eventually reach the epilogue # this looks insane. w10 = 0|1, then branch if bit[0] != 0. Always taken, right? orr w10, wzr, #0x1 tbnz w10, #0, .LBB0_5 @ test-bit not-zero will always jump to the epilogue b .LBB0_6 # never reached .LBB0_3: clrex @ clear the ldrx/stxr transaction .LBB0_4: # This block is pointless; just should b to .LBB0_6 mov w10, wzr tbz w10, #0, .LBB0_6 # go to the retry loop, below the ret (not shown here) .LBB0_5: # isolate_lowest_set and return neg x8, x9 and x0, x9, x8 ret .LBB0_6: @ the retry loop. Only reached if the compare or SC failed. ...
All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173](https://bugs.llvm.org/show_bug.cgi?id=38173)
It seems MSVC doesn't know that AArch64's release-stores are sequential-release (not just regular release like x86), because it's still using a dmb ish
instruction (full memory barrier) after stlxr
. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak
compiles like compare_exchange_strong
with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure. That will usually be because another thread modified the line, so expected will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb
)
## MSVC Pre 2018 -Ox |pop_lowest_non_zero| PROC ldr x10,[x0] # x10 = expected = foo.load(relaxed) |$LL2@pop_lowest| @ L2 # top of the while() loop sub x8,x10,#1 and x11,x8,x10 # clear_lowest(relaxed load result) |$LN76@pop_lowest| @ LN76 ldaxr x8,[x0] cmp x8,x10 # the compare part of CAS bne |$LN77@pop_lowest| stlxr w9,x11,[x0] cbnz w9,|$LN76@pop_lowest| # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong # fall through on LL/SC success |$LN77@pop_lowest| @ LN77 dmb ish # full memory barrier: slow cmp x8,x10 # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths. beq |$LN75@pop_lowest| # if(expected matches) goto epilogue mov x10,x8 # else update expected b |$LL2@pop_lowest| # and goto the top of the while loop |$LN75@pop_lowest| @ LN75 # function epilogue neg x8,x10 and x0,x8,x10 ret
gcc6.3 for AArch64 makes weird code, too, storing expected
to the stack. (Godbolt doesn't have newer AArch64 gcc).
I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.
pop_lowest_non_zero ## hand written based on clang # clang could emit this even without turning CAS into LL/SC directly ldr x9, [x0] @ x9 = expected = foo.load(relaxed) .Lcas_retry: ldaxr x8, [x0] @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr) cmp x8, x9 @ compare part of the CAS b.ne .Lcas_fail sub x10, x9, #1 and x10, x10, x9 @ clear_lowest (relaxed load result) stlxr w11, x10, [x0] @ the CAS store (sequential-release) cbnz w11, .Lllsc_fail # LL/SC success: isolate_lowest_set and return .Lepilogue: neg x8, x9 and x0, x9, x8 ret .Lcas_fail: clrex @ clear the ldrx/stxr transaction .Lllsc_fail: mov x9, x8 @ update expected b .Lcas_retry @ instead of duplicating the loop, jump back to the existing one with x9 = expected
.fetch_add
:Clang does make nice code for fetch_add()
:
mov x8, x0 .LBB1_1: ldxr x0, [x8] # relaxed exclusive load: I used mo_release add x9, x0, #1 stlxr w10, x9, [x8] cbnz w10, .LBB1_1 # retry if LL/SC failed ret
Instead of add #1
, we'd like to have sub x9, x8, #1
/and x9, x9, x8
, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.
Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) <<(64-counter)
(for non-zero counter only). Or maybe tmp=1ULL <
xor eax,eax
/bts rax, rdi
(rax=1 set bit at position 0..63)/blsmsk rax, rax
(rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.
Or if there's some pattern to the bitmap, x86 BMI2 pdep
can scatter contiguous bits into that pattern. Get N contiguous set bits with (1ULL <
Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?
No idea what your use-case is, but this kind of idea might be useful:
Do you need a single atomic bitmap that all threads have to contend for? Perhaps you could break it into multiple chunks, each in a separate cache line. (But that makes it impossible to atomically snapshot the entire map.) Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.
In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg
on it. That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit. But it could reduce retries between this and something that sets other bits in the qword. Of course this only works if it's not a correctness problem for lock cmpxchg
to succeed when other bytes of the qword have changed.