原子清除无符号整数的最低非零位

作者：狂风 | 来源：互联网 | 2022-12-10 17:37

如何解决《原子清除无符号整数的最低非零位》经验，为你挑选了1个好方法。

问题:
我正在寻找清除无符号原子的最低非零位的最佳方法,如std::atomic_uint64_t线程安全方式,而不使用额外的互斥锁等.另外,我还需要知道,哪个位被清除了.

示例: 假设,如果存储的当前值是,0b0110我想知道最低的非零位是位1(0索引)并将变量设置为0b0100.

我想出了最好的版本是这样:

#include 
#include 

inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){
    return v & ~with_lowest_non_zero_cleared(v);
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed))
    {;}
    return only_keep_lowest_non_zero(expected);
}

有更好的解决方案吗？

笔记:

它不一定是最低的非零位 - 我也很满意清除最高位或设置最高/最低"零位"的解决方案,如果这有所不同

我非常喜欢标准的c ++(17)版本,但会接受一个使用内在函数来表达clang和msvc的答案

它应该是可移植的(使用clang或msvc编译x64和AArch64),但我最感兴趣的是使用clang编译时最近的intel和AMD x64架构的性能.

编辑:更新必须是原子的,必须保证全局进度,但正如上面的解决方案,它不一定是一个等待免费的算法(当然,如果你能给我一个,我会很高兴).

Peter Cordes.. 7

There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr is only available in memory-source form, not memory-destination form; lock blsr [shared_var] does not exist. (Compilers know how to optimize (var-1) & (var) into blsr for local vars when you compile with -march=haswell or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support¹, it doesn't let you do anything fundamentally different.

你在x86上做的最好的就是lock cmpxchg你在问题中提出的重试循环.这比在旧版本的变量中找到正确的位然后使用更好的选择lock btr,特别是如果在位扫描和位之间设置较低的位来清除错误的位将是正确性问题lock btr.如果另一个线程已经清除了你想要的位,你仍然需要一个重试循环.

CAS重试循环在实践中并不坏:重试非常罕见

(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)

CPU失去高速缓存行的窗口以获取expected和运行负载之间的高速缓存行的窗口lock cmpxchg很少. 通常它会在第一次通过时成功,因为cmpxchg运行时缓存行仍然会出现在L1d缓存中.当高速缓存行到达时,它(希望)已经处于MESI Exclusive状态,如果CPU看到足够远,可以为它做RFO.

您可以检测cmpxchg重试循环以查看实际程序中实际获得的争用程度.(例如,通过递增循环中的局部并在成功后使用原子松弛+=到共享计数器,或者使用线程局部计数器.)

Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.

EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).

A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.

如果缓存失败,CAS会保留未修改的缓存行.在x86上它会弄脏缓存行(MESI M状态),但是x86 cmpxchg没有检测到ABA,它只进行比较,因此加载相同的另一个线程expected将成功.在LL/SC机器上,希望一个核心上的SC故障不会导致其他核心上的SC故障,否则它可能会活锁.这是你可以假设CPU架构师想到的东西.

它不是等待的,因为单个线程理论上不得不重试无限次.

你的代码编译gcc8.1 -O3 -march=haswell成这个asm(来自Godbolt编译器资源管理器)

# gcc's code-gen for x86 with BMI1 looks optimal to me.  No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic&):
    mov     rax, QWORD PTR [rdi]
.L2:
    blsr    rdx, rax                      # Bit Lowest Set Reset
    lock cmpxchg    QWORD PTR [rdi], rdx
    jne     .L2                           # fall through on success: cmpxchg sets ZF on equal like regular cmp
    blsi    rax, rax                      # Bit Lowest Set Isolate
    ret

没有BMI1,blsr和blsi各自变成两条指令.考虑到locked指令的成本,这几乎与性能无关.

clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely() macro. How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).

Footnote 1:

除非您为一组已知的机器构建二进制文件,否则无论如何都不能假设BMI1可用.因此编译器需要执行类似lea rdx, [rax-1]/ and rdx, rax而不是代码的操作blsr rdx, rax.这对于这个功能来说是微不足道的.即使在非竞争情况下,大部分成本也是原子操作,即使对于热插入高速缓存也没有争用情况,查看无序执行吞吐量对周围代码的影响.(例如lock cmpxchg,Skylake(http://agner.org/optimize/)上的10 uops 与使用时保存1 uop相比blsr而不是其他2个指令.假设前端是瓶颈,而不是内存或其他东西.作为一个完整的内存屏障也会对周围代码中的加载/存储产生影响,但幸运的是,不会出现独立ALU指令的无序执行.)

非x86:LL/SC机器

Most non-x86 machines do all their atomic operations with load-linked/store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x inside an LL/SC instead of just the add that you'd get from using fetch_add), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)

So you just have to write code using compare_exchange_weak if C++ doesn't provide the operation you want (e.g. fetch_or or fetch_and).

您从当前编译器中获得的实践是compare_exchange_weak使用LL/SC实现的,并且您的算术运算与此分开.asm实际上在exclusive-load-acquire(ldaxr)之前执行轻松加载,而不是仅仅根据ldaxr结果进行计算.并且还有额外的分支来检查expected上次尝试是否与尝试存储之前的新加载结果相匹配.

LL/SC窗口可能比加载和存储之间的2个相关ALU指令短,以防万一.CPU提前准备好所需的值,而不依赖于负载结果.(这取决于之前的加载结果.)Clang的code-gen将sub/ and置于循环内部,但依赖于前一次迭代的加载,因此无序执行时它们仍然可以提前准备好.

但如果这确实是最有效的方法,那么编译器也应该使用它fetch_add.他们没有,因为它可能不是.LL/SC重试很少见,就像x86上的CAS重试一样.我不知道它是否可以在非常高争用的情况下有所不同.也许,但编译器在编译时不会减慢快速路径以进行优化fetch_add.

我重命名了你的函数并重新格式化了while()for的可读性,因为对于一行没有包装它已经太长了unlikely().

This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.

#include 
#include 

#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#endif

inline uint64_t clear_lowest_set(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t isolate_lowest_set(std::uint64_t v){
    //return v & ~clear_lowest_set(v);
    return (-v) & v;
    // MSVC optimizes this better for ARM when used separately.
    // The other way (in terms of clear_lowest_set) does still CSE to 2 instructions
    //  when the clear_lowest_set result is already available
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(unlikely(!foo.compare_exchange_weak(
        expected, clear_lowest_set(expected), 
        std::memory_order_seq_cst, std::memory_order_relaxed)))
        {}

    return isolate_lowest_set(expected);
}

Clang -O3 for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57 on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.

pop_lowest_non_zero(std::__1::atomic&): // @pop_lowest_non_zero(std::__1::atomic&)

    ldr     x9, [x0]                   @ the relaxed load
    ldaxr   x8, [x0]                   @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .LBB0_3
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest( relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .LBB0_4             @ if(SC failed) goto retry loop

          # else fall through and eventually reach the epilogue

    # this looks insane.  w10 = 0|1, then branch if bit[0] != 0.  Always taken, right?
    orr     w10, wzr, #0x1
    tbnz    w10, #0, .LBB0_5         @ test-bit not-zero will always jump to the epilogue
        b       .LBB0_6              # never reached


.LBB0_3:
    clrex                           @ clear the ldrx/stxr transaction
.LBB0_4:
    # This block is pointless; just should b to .LBB0_6
    mov     w10, wzr
    tbz     w10, #0, .LBB0_6    # go to the retry loop, below the ret (not shown here)

.LBB0_5:               # isolate_lowest_set and return
    neg     x8, x9
    and     x0, x9, x8
    ret

.LBB0_6:
     @ the retry loop.  Only reached if the compare or SC failed.
     ...

All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173](https://bugs.llvm.org/show_bug.cgi?id=38173)

It seems MSVC doesn't know that AArch64's release-stores are sequential-release (not just regular release like x86), because it's still using a dmb ish instruction (full memory barrier) after stlxr. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak compiles like compare_exchange_strong with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure. That will usually be because another thread modified the line, so expected will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb)

   ## MSVC Pre 2018 -Ox
|pop_lowest_non_zero| PROC
    ldr         x10,[x0]          # x10 = expected = foo.load(relaxed)

|$LL2@pop_lowest|           @ L2  # top of the while() loop
    sub         x8,x10,#1
    and         x11,x8,x10        # clear_lowest(relaxed load result)
|$LN76@pop_lowest|          @ LN76
    ldaxr       x8,[x0]
    cmp         x8,x10            # the compare part of CAS
    bne         |$LN77@pop_lowest|
    stlxr       w9,x11,[x0]
    cbnz        w9,|$LN76@pop_lowest|  # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong
     # fall through on LL/SC success

|$LN77@pop_lowest|          @ LN77
    dmb         ish                # full memory barrier: slow
    cmp         x8,x10             # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths.
    beq         |$LN75@pop_lowest| # if(expected matches) goto epilogue
    mov         x10,x8             # else update expected
    b           |$LL2@pop_lowest|  # and goto the top of the while loop

|$LN75@pop_lowest|          @ LN75   # function epilogue
    neg         x8,x10
    and         x0,x8,x10
    ret

gcc6.3 for AArch64 makes weird code, too, storing expected to the stack. (Godbolt doesn't have newer AArch64 gcc).

I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.

pop_lowest_non_zero ## hand written based on clang
                    # clang could emit this even without turning CAS into LL/SC directly

    ldr     x9, [x0]                   @ x9 = expected = foo.load(relaxed)
 .Lcas_retry:
    ldaxr   x8, [x0]                   @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .Lcas_fail
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest (relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .Lllsc_fail

    # LL/SC success: isolate_lowest_set and return
.Lepilogue:
    neg     x8, x9
    and     x0, x9, x8
    ret

.Lcas_fail:
    clrex                           @ clear the ldrx/stxr transaction
.Lllsc_fail:
    mov     x9, x8                  @ update expected
    b     .Lcas_retry           @ instead of duplicating the loop, jump back to the existing one with x9 = expected

Compare with `.fetch_add`:

Clang does make nice code for fetch_add():

    mov     x8, x0
.LBB1_1:
    ldxr    x0, [x8]                # relaxed exclusive load: I used mo_release
    add     x9, x0, #1
    stlxr   w10, x9, [x8]
    cbnz    w10, .LBB1_1            # retry if LL/SC failed
    ret

Instead of add #1, we'd like to have sub x9, x8, #1/and x9, x9, x8, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.

Alternatives: Do you even need exactly this bitmap operation? Can you break it up to reduce contention?

Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) <<(64-counter) (for non-zero counter only). Or maybe tmp=1ULL < (i.e. x86 xor eax,eax/bts rax, rdi (rax=1 set bit at position 0..63)/blsmsk rax, rax (rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.



Or if there's some pattern to the bitmap, x86 BMI2 pdep can scatter contiguous bits into that pattern.  Get N contiguous set bits with (1ULL <, again for counter <64.




Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?

No idea what your use-case is, but this kind of idea might be useful:

Do you need a single atomic bitmap that all threads have to contend for?  Perhaps you could break it into multiple chunks, each in a separate cache line.  (But that makes it impossible to atomically snapshot the entire map.)  Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.

In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg on it.  That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit.  But it could reduce retries between this and something that sets other bits in the qword.  Of course this only works if it's not a correctness problem for lock cmpxchg to succeed when other bytes of the qword have changed.


1> Peter Cordes..：
There is no direct hardware support for atomic clear-lowest-bit on x86.  BMI1 blsr is only available in memory-source form, not memory-destination form; lock blsr [shared_var] does not exist.  (Compilers know how to optimize (var-1) & (var) into blsr for local vars when you compile with -march=haswell or otherwise enable code-gen that assumes BMI1 support.)  So even if you can assume BMI1 support¹, it doesn't let you do anything fundamentally different.

你在x86上做的最好的就是lock cmpxchg你在问题中提出的重试循环.这比在旧版本的变量中找到正确的位然后使用更好的选择lock btr,特别是如果在位扫描和位之间设置较低的位来清除错误的位将是正确性问题lock btr.如果另一个线程已经清除了你想要的位,你仍然需要一个重试循环.

CAS重试循环在实践中并不坏:重试非常罕见

(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add where there's no trying, just hardware arbitration for access to cache lines.  If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line.  Lockless is great in the low-contention case.)

CPU失去高速缓存行的窗口以获取expected和运行负载之间的高速缓存行的窗口lock cmpxchg很少.  通常它会在第一次通过时成功,因为cmpxchg运行时缓存行仍然会出现在L1d缓存中.当高速缓存行到达时,它(希望)已经处于MESI Exclusive状态,如果CPU看到足够远,可以为它做RFO.

您可以检测cmpxchg重试循环以查看实际程序中实际获得的争用程度.(例如,通过递增循环中的局部并在成功后使用原子松弛+=到共享计数器,或者使用线程局部计数器.)

Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.


  EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).


A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.

如果缓存失败,CAS会保留未修改的缓存行.在x86上它会弄脏缓存行(MESI M状态),但是x86 cmpxchg没有检测到ABA,它只进行比较,因此加载相同的另一个线程expected将成功.在LL/SC机器上,希望一个核心上的SC故障不会导致其他核心上的SC故障,否则它可能会活锁.这是你可以假设CPU架构师想到的东西.

它不是等待的,因为单个线程理论上不得不重试无限次.



你的代码编译gcc8.1 -O3 -march=haswell成这个asm(来自Godbolt编译器资源管理器)

# gcc's code-gen for x86 with BMI1 looks optimal to me.  No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic&):
    mov     rax, QWORD PTR [rdi]
.L2:
    blsr    rdx, rax                      # Bit Lowest Set Reset
    lock cmpxchg    QWORD PTR [rdi], rdx
    jne     .L2                           # fall through on success: cmpxchg sets ZF on equal like regular cmp
    blsi    rax, rax                      # Bit Lowest Set Isolate
    ret


没有BMI1,blsr和blsi各自变成两条指令.考虑到locked指令的成本,这几乎与性能无关.

clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path.  (And clang bloats the function by peeling the first iteration.  IDK if this helps with branch prediction or something, or if it's purely a missed optimization.  We can get clang to generate the fast path with no taken branches using an unlikely() macro.  How do the likely() and unlikely() macros in the Linux kernel work and what is their benefit?).



Footnote 1: 

除非您为一组已知的机器构建二进制文件,否则无论如何都不能假设BMI1可用.因此编译器需要执行类似lea rdx, [rax-1]/ and rdx, rax而不是代码的操作blsr rdx, rax.这对于这个功能来说是微不足道的.即使在非竞争情况下,大部分成本也是原子操作,即使对于热插入高速缓存也没有争用情况,查看无序执行吞吐量对周围代码的影响.(例如lock cmpxchg,Skylake(http://agner.org/optimize/)上的10 uops 与使用时保存1 uop相比blsr而不是其他2个指令.假设前端是瓶颈,而不是内存或其他东西.作为一个完整的内存屏障也会对周围代码中的加载/存储产生影响,但幸运的是,不会出现独立ALU指令的无序执行.)



非x86:LL/SC机器

Most non-x86 machines do all their atomic operations with load-linked/store-conditional retry loops.  It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x inside an LL/SC instead of just the add that you'd get from using fetch_add), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides.  So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs.  (See this discussion.)

So you just have to write code using compare_exchange_weak if C++ doesn't provide the operation you want (e.g. fetch_or or fetch_and).

您从当前编译器中获得的实践是compare_exchange_weak使用LL/SC实现的,并且您的算术运算与此分开.asm实际上在exclusive-load-acquire(ldaxr)之前执行轻松加载,而不是仅仅根据ldaxr结果进行计算.并且还有额外的分支来检查expected上次尝试是否与尝试存储之前的新加载结果相匹配.

LL/SC窗口可能比加载和存储之间的2个相关ALU指令短,以防万一.CPU提前准备好所需的值,而不依赖于负载结果.(这取决于之前的加载结果.)Clang的code-gen将sub/ and置于循环内部,但依赖于前一次迭代的加载,因此无序执行时它们仍然可以提前准备好.

但如果这确实是最有效的方法,那么编译器也应该使用它fetch_add.他们没有,因为它可能不是.LL/SC重试很少见,就像x86上的CAS重试一样.我不知道它是否可以在非常高争用的情况下有所不同.也许,但编译器在编译时不会减慢快速路径以进行优化fetch_add.

我重命名了你的函数并重新格式化了while()for的可读性,因为对于一行没有包装它已经太长了unlikely().

This version compiles to maybe slightly nicer asm than your original, with clang.  I also fixed your function names so it actually compiles at all; there's a mismatch in your question.  I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.

#include 
#include 

#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#endif

inline uint64_t clear_lowest_set(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t isolate_lowest_set(std::uint64_t v){
    //return v & ~clear_lowest_set(v);
    return (-v) & v;
    // MSVC optimizes this better for ARM when used separately.
    // The other way (in terms of clear_lowest_set) does still CSE to 2 instructions
    //  when the clear_lowest_set result is already available
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(unlikely(!foo.compare_exchange_weak(
        expected, clear_lowest_set(expected), 
        std::memory_order_seq_cst, std::memory_order_relaxed)))
        {}

    return isolate_lowest_set(expected);
}


Clang -O3 for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57 on Godbolt) makes some weird code.  This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.

pop_lowest_non_zero(std::__1::atomic&): // @pop_lowest_non_zero(std::__1::atomic&)

    ldr     x9, [x0]                   @ the relaxed load
    ldaxr   x8, [x0]                   @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .LBB0_3
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest( relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .LBB0_4             @ if(SC failed) goto retry loop

          # else fall through and eventually reach the epilogue

    # this looks insane.  w10 = 0|1, then branch if bit[0] != 0.  Always taken, right?
    orr     w10, wzr, #0x1
    tbnz    w10, #0, .LBB0_5         @ test-bit not-zero will always jump to the epilogue
        b       .LBB0_6              # never reached


.LBB0_3:
    clrex                           @ clear the ldrx/stxr transaction
.LBB0_4:
    # This block is pointless; just should b to .LBB0_6
    mov     w10, wzr
    tbz     w10, #0, .LBB0_6    # go to the retry loop, below the ret (not shown here)

.LBB0_5:               # isolate_lowest_set and return
    neg     x8, x9
    and     x0, x9, x8
    ret

.LBB0_6:
     @ the retry loop.  Only reached if the compare or SC failed.
     ...


All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS).  Reported as clang/LLVM bug 38173](https://bugs.llvm.org/show_bug.cgi?id=38173)

It seems MSVC doesn't know that AArch64's release-stores are sequential-release (not just regular release like x86), because it's still using a dmb ish instruction (full memory barrier) after stlxr.  It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak compiles like compare_exchange_strong with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure.  That will usually be because another thread modified the line, so expected will mismatch.  (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new.  That might explain the dmb)

   ## MSVC Pre 2018 -Ox
|pop_lowest_non_zero| PROC
    ldr         x10,[x0]          # x10 = expected = foo.load(relaxed)

|$LL2@pop_lowest|           @ L2  # top of the while() loop
    sub         x8,x10,#1
    and         x11,x8,x10        # clear_lowest(relaxed load result)
|$LN76@pop_lowest|          @ LN76
    ldaxr       x8,[x0]
    cmp         x8,x10            # the compare part of CAS
    bne         |$LN77@pop_lowest|
    stlxr       w9,x11,[x0]
    cbnz        w9,|$LN76@pop_lowest|  # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong
     # fall through on LL/SC success

|$LN77@pop_lowest|          @ LN77
    dmb         ish                # full memory barrier: slow
    cmp         x8,x10             # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths.
    beq         |$LN75@pop_lowest| # if(expected matches) goto epilogue
    mov         x10,x8             # else update expected
    b           |$LL2@pop_lowest|  # and goto the top of the while loop

|$LN75@pop_lowest|          @ LN75   # function epilogue
    neg         x8,x10
    and         x0,x8,x10
    ret


gcc6.3 for AArch64 makes weird code, too, storing expected to the stack.  (Godbolt doesn't have newer AArch64 gcc).

I'm very unimpressed with AArch64 compilers for this!  IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.

pop_lowest_non_zero ## hand written based on clang
                    # clang could emit this even without turning CAS into LL/SC directly

    ldr     x9, [x0]                   @ x9 = expected = foo.load(relaxed)
 .Lcas_retry:
    ldaxr   x8, [x0]                   @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .Lcas_fail
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest (relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .Lllsc_fail

    # LL/SC success: isolate_lowest_set and return
.Lepilogue:
    neg     x8, x9
    and     x0, x9, x8
    ret

.Lcas_fail:
    clrex                           @ clear the ldrx/stxr transaction
.Lllsc_fail:
    mov     x9, x8                  @ update expected
    b     .Lcas_retry           @ instead of duplicating the loop, jump back to the existing one with x9 = expected


Compare with .fetch_add:

Clang does make nice code for fetch_add():

    mov     x8, x0
.LBB1_1:
    ldxr    x0, [x8]                # relaxed exclusive load: I used mo_release
    add     x9, x0, #1
    stlxr   w10, x9, [x8]
    cbnz    w10, .LBB1_1            # retry if LL/SC failed
    ret


Instead of add #1, we'd like to have sub x9, x8, #1/and   x9, x9, x8, so we just get a LL/SC retry loop.  This saves code-size, but it won't be significantly faster.  Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.



Alternatives: Do you even need exactly this bitmap operation?  Can you break it up to reduce contention?

Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed?  Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) <<(64-counter) (for non-zero counter only).  Or maybe tmp=1ULL < (i.e. x86 xor eax,eax/bts rax, rdi (rax=1 set bit at position 0..63)/blsmsk rax, rax (rax=all bits set up to that position).  Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.


Or if there's some pattern to the bitmap, x86 BMI2 pdep can scatter contiguous bits into that pattern.  Get N contiguous set bits with (1ULL <, again for counter <64.




Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?

No idea what your use-case is, but this kind of idea might be useful:

Do you need a single atomic bitmap that all threads have to contend for?  Perhaps you could break it into multiple chunks, each in a separate cache line.  (But that makes it impossible to atomically snapshot the entire map.)  Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.

In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg on it.  That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit.  But it could reduce retries between this and something that sets other bits in the qword.  Of course this only works if it's not a correctness problem for lock cmpxchg to succeed when other bytes of the qword have changed.




    
        
                        c++
                        arm
                        x86
                        int
                        安全
                        include
                        编译
                        架构
                        算法
                    
    



    
        写下你的评论吧 !
        
            
                吐个槽吧,看都看了
            
            
                
                                        会员登录 | 用户注册
                                    
                
            
        

        
    

    
        推荐阅读
        
            
                                
                    
                        version
                        如何更有效地提升对支持部门的协助与支撑？ - Enhancing Support for the Support Department: Strategies and Best Practices
                    

                    
                                                
                        尽管我们尽最大努力，任何软件开发过程中都难免会出现缺陷。为了更有效地提升对支持部门的协助与支撑，本文探讨了多种策略和最佳实践，旨在通过改进沟通、增强培训和支持流程来减少这些缺陷的影响，并提高整体服务质量和客户满意度。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-07 06:55:33
                    

                

                
                                
                    
                        go
                        POJ 2482 星空中的星星：利用线段树与扫描线算法解决
                    

                    
                                                
                            
                        
                                                
                        在《POJ 2482 星空中的星星》问题中，通过运用线段树和扫描线算法，可以高效地解决星星在窗口内的计数问题。该方法不仅能够快速处理大规模数据，还能确保时间复杂度的最优性，适用于各种复杂的星空模拟场景。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-09 12:09:08
                    

                

                                
                    
                    
                
                
                                
                    
                        function
                        深入解析 Java 与 Linux 设备驱动开发：基于最新 Linux 4.0 内核的详细指南
                    

                    
                                                
                        本书详细介绍了在最新Linux 4.0内核环境下进行Java与Linux设备驱动开发的全面指南。内容涵盖设备驱动的基本概念、开发环境的搭建、操作系统对设备驱动的影响以及具体开发步骤和技巧。通过丰富的实例和深入的技术解析，帮助读者掌握设备驱动开发的核心技术和最佳实践。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-02 08:40:11
                    

                

                
                                
                    
                        tree
                        如何使用 net.sf.extjwnl.data.Word 类及其代码示例详解
                    

                    
                                                
                        如何使用 net.sf.extjwnl.data.Word 类及其代码示例详解 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-01 19:30:32
                    

                

                
                                
                    
                        uri
                        深入RTOS实践，面对原子操作提问竟感困惑
                    

                    
                                                
                            
                        
                                                
                        在实时操作系统（RTOS）的实践中，尽管已经积累了丰富的经验，但在面对原子操作的具体问题时，仍感到困惑。本文将深入探讨RTOS中的原子操作机制，分析其在多任务环境下的重要性和实现方式，并结合实际案例解析常见的问题及解决方案，帮助读者更好地理解和应用这一关键技术。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-10-28 13:40:08
                    

                

                
                                
                    
                        go
                        Ave V8 JavaScript 引擎：持续优化与创新
                    

                    
                                                
                        V8不仅是一款著名的八缸发动机，广泛应用于道奇Charger、宾利Continental GT和BossHoss摩托车中。自2008年以来，作为Chromium项目的一部分，V8 JavaScript引擎在性能优化和技术创新方面取得了显著进展。该引擎通过先进的编译技术和高效的垃圾回收机制，显著提升了JavaScript的执行效率，为现代Web应用提供了强大的支持。持续的优化和创新使得V8在处理复杂计算和大规模数据时表现更加出色，成为众多开发者和企业的首选。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-09 15:56:40
                    

                

                
                                
                    
                        hash
                        如何利用Java 5 Executor框架高效构建和管理线程池
                    

                    
                                                
                            
                        
                                                
                        Java 5 引入了 Executor 框架，为开发人员提供了一种高效管理和构建线程池的方法。该框架通过将任务提交与任务执行分离，简化了多线程编程的复杂性。利用 Executor 框架，开发人员可以更灵活地控制线程的创建、分配和管理，从而提高服务器端应用的性能和响应能力。此外，该框架还提供了多种线程池实现，如固定线程池、缓存线程池和单线程池，以适应不同的应用场景和需求。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-07 17:05:32
                    

                

                
                                
                    
                        数组
                        第六章：枚举类型与switch结构的应用分析
                    

                    
                                                
                            
                        
                                                
                        第六章深入探讨了枚举类型与 `switch` 结构在编程中的应用。枚举类型（`enum`）是一种将一组相关常量组织在一起的数据类型，广泛存在于多种编程语言中。例如，在 Cocoa 框架中，处理文本对齐时常用 `NSTextAlignment` 枚举来表示不同的对齐方式。通过结合 `switch` 结构，可以更清晰、高效地实现基于枚举值的逻辑分支，提高代码的可读性和维护性。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-07 14:36:27
                    

                

                
                                
                    
                        hash
                        技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告
                    

                    
                                                
                            
                        
                                                
                        技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-07 14:33:19
                    

                

                
                                
                    
                        go
                        Java环境中Selenium Chrome驱动在大规模Web应用扩展时的性能限制分析
                    

                    
                                                
                        Java环境中Selenium Chrome驱动在大规模Web应用扩展时的性能限制分析 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-07 10:10:30
                    

                

                
                                
                    
                        tree
                        探索阿里云RDS中MySQL的高效压缩存储引擎TokuDB应用
                    

                    
                                                
                        在过去，我曾使用过自建MySQL服务器中的MyISAM和InnoDB存储引擎（也曾尝试过Memory引擎）。今年初，我开始转向阿里云的关系型数据库服务，并深入研究了其高效的压缩存储引擎TokuDB。TokuDB在数据压缩和处理大规模数据集方面表现出色，显著提升了存储效率和查询性能。通过实际应用，我发现TokuDB不仅能够有效减少存储成本，还能显著提高数据处理速度，特别适用于高并发和大数据量的场景。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-04 11:36:52
                    

                

                
                                
                    
                        数组
                        尝试对从复杂 XSD 生成的类进行序列化时出现 NullReferenceException 错误
                    

                    
                                                
                        在尝试对从复杂 XSD 生成的类进行序列化时，遇到了 `NullReferenceException` 错误。尽管已经花费了数小时进行调试和搜索相关资料，但仍然无法找到问题的根源。希望社区能够提供一些指导和建议，帮助解决这一难题。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-11-02 17:25:38
                    

                

                
                                
                    
                        go
                        The Evolution and Development of MySQL: A Comprehensive Overview
                    

                    
                                                
                        本文全面概述了MySQL的发展历程与演进。最初，我们旨在通过自定义的快速低级（ISAM）接口连接到表格，利用mSQL数据库系统。随着时间的推移，MySQL不仅在性能和稳定性上取得了显著提升，还引入了多种高级功能，如事务处理、存储过程和视图等，成为全球广泛使用的开源数据库管理系统之一。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-10-29 19:54:00
                    

                

                
                                
                    
                        stream
                        利用JSON数据格式精准获取用户当前地理位置信息
                    

                    
                                                
                        通过采用JSON数据格式，能够高效且精确地获取用户的实时地理位置信息，为各类位置服务应用提供可靠的数据支持。该方法不仅简化了数据交换流程，还提高了地理信息处理的准确性和效率，适用于移动应用、导航系统及物联网设备等多种场景。 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-10-25 19:00:21
                    

                

                
                                
                    
                        hash
                        如何使用 `com.amazonaws.services.sqs.model.DeleteMessageRequest` 的 `getQueueUrl()` 方法及其代码示例解析
                    

                    
                                                
                        如何使用 `com.amazonaws.services.sqs.model.DeleteMessageRequest` 的 `getQueueUrl()` 方法及其代码示例解析 ...
                        [详细]
                    
                    

                    
                        蜡笔小新   2024-10-23 13:56:57

















    

    
        
            
            
                
                
            

            
                狂风            

            
                这个家伙很懒，什么也没留下！            


        
    

    
    

    
    

    
        Tags | 热门标签
        
            
                                
                    emoji
                
                                
                    golang
                
                                
                    runtime
                
                                
                    js
                
                                
                    expression
                
                                
                    web3
                
                                
                    install
                
                                
                    client
                
                                
                    format
                
                                
                    bit
                
                                
                    cPlusPlus
                
                                
                    bytecode
                
                                
                    php8
                
                                
                    go
                
                                
                    merge
                
                                
                    uri
                
                                
                    function
                
                                
                    main
                
                                
                    version
                
                                
                    express
                
                                
                    数组
                
                                
                    tree
                
                                
                    httprequest
                
                                
                    yaml
                
                                
                    plugins
                
                                
                    timezone
                
                                
                    stream
                
                                
                    hash
                
                                
                    heap
                
                                
                    hashtable
                
                                
            
        
    

    
    
        
            
            
        
        RankList | 热门文章
        
            
                                
                    1区块链研究实验室|停止使用Solidity的transfer（）函数
                
                                
                    2自动驾驶_未来出行第一题：自动驾驶的治堵方案真的可行吗？
                
                                
                    3Qt代码美化——Astyle
                
                                
                    4不同运营商链路聚合_￥15元，让群晖速度翻倍低成本简易链路聚合SMB3多通道...
                
                                
                    5Maven多module打包示例
                
                                
                    6php<1>
                
                                
                    7深入浅出mysql第12章SQL中的安全问题+第13章SQLMode及相关问题+14章Mysql分区习题答案
                
                                
                    8即使我选择了另一个单选按钮，单选按钮也保持选中状态
                
                                
                    9如何快速搭建内容付费平台
                
                                
                    10Leetcode 0025. Reverse Nodes in kGroup
                
                                
                    11恢复丢失的分区
                
                                
                    12PyMySQL基本使用以及SQL注入问题
                
                                
                    13【粗解】【通信编码】卷积编码器的简单实现
                
                                
                    14Android_Android Studio 导出APK
                
                                
                    15[2018最新]手把手教你使用Maven搭建SSM框架（SpringMVC+Spring+Mybatis）超详细