• Synchronization

    From MitchAlsup@[email protected] to comp.arch on Wed May 27 21:44:39 2026
    From Newsgroup: comp.arch


    This is mainly for Chris M. Thomasson but everyone is free to pile on::

    Given a normal chip with multiple cores and a typical cache hierarchy
    with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties
    see that the unit of work was either never started or finished completely without any intermediate state being observed by any party other than the
    core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} fall under this definition.

    define: memory-atomic as a unit of work identified in a core and packaged
    up to be performed by the cache hierarchy or PCIe device control registers. Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- of-work are performed in the cache having write-permission {EXCLUSIVE or MODIFIED cache line states}. When unCacheable, these units of work are performed directly on the addressed control register. In both cases, any interested party sees only the previous state or the after state.

    Core-atomic events have the cache hierarchy performing a number (often quadratic) cache line transfers to satisfy a single event ({failing
    (n-1) others when under contention}.

    Memory-atomic events cannot escape the interconnect latency overhead
    but move no data around that does not have to be moved.

    Question: If you were starting lout with a clean slate <as it were>,
    and not knowing if the atomic-shared-library is to be used in places
    where core count was unknown AND interference level was unknown::

    Would you implement the atomic-shared-library using core-atomic inst-
    ructions or would you implement the library using memory-atomic inst-
    ructions ?? Or would you implement some of the library as core-atomic
    and other parts as memory-atomic?? and why!

    Auxiliary question: Given the definition <above> of memory-atomic::
    are not all to-memory packaged events not ATOMIC in the Lamport sense
    as long as the package leaves the core sequentially consistent ??

    Thanks
    Mitch
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu May 28 12:37:14 2026
    From Newsgroup: comp.arch

    On 5/27/2026 2:44 PM, MitchAlsup wrote:

    This is mainly for Chris M. Thomasson but everyone is free to pile on::

    Given a normal chip with multiple cores and a typical cache hierarchy
    with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties
    see that the unit of work was either never started or finished completely without any intermediate state being observed by any party other than the core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} fall under this definition.

    define: memory-atomic as a unit of work identified in a core and packaged
    up to be performed by the cache hierarchy or PCIe device control registers. Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- of-work are performed in the cache having write-permission {EXCLUSIVE or MODIFIED cache line states}. When unCacheable, these units of work are performed directly on the addressed control register. In both cases, any interested party sees only the previous state or the after state.

    Core-atomic events have the cache hierarchy performing a number (often quadratic) cache line transfers to satisfy a single event ({failing
    (n-1) others when under contention}.

    Memory-atomic events cannot escape the interconnect latency overhead
    but move no data around that does not have to be moved.

    Question: If you were starting lout with a clean slate <as it were>,
    and not knowing if the atomic-shared-library is to be used in places
    where core count was unknown AND interference level was unknown::

    Would you implement the atomic-shared-library using core-atomic inst- ructions or would you implement the library using memory-atomic inst- ructions ?? Or would you implement some of the library as core-atomic
    and other parts as memory-atomic?? and why!

    Auxiliary question: Given the definition <above> of memory-atomic::
    are not all to-memory packaged events not ATOMIC in the Lamport sense
    as long as the package leaves the core sequentially consistent ??

    Both, but the choice is almost "secondary" in a sense as to whether the programmer has done their job first... Also, well, I am quite fond of
    relaxed systems. Shit. Perhaps not at relaxed as an alpha, wrt
    data-dependent loads.

    Before you even pick the instruction class, the programmer must
    eliminate false sharing. If the target is an LL/SC architecture, they
    really _need_ to know the reservation granule size and pad/align
    accordingly. For RMW like LOCK XADD, never straddle a l2 cache line
    boundary — pad and align to it. These aren't optional courtesies,
    they're prerequisites...?

    Actually, an arch that has both LL/SC and atomic RMW, well, that's
    pretty nice. I think AMD has that? Give the programmer the ability to
    choose.

    Beyond that, the deeper answer is: don't hammer a single location at
    all! damn it. A single lock-free stack under contention is a problem regardless of whether you're using core-atomic or memory-atomic
    instructions. A hash table of lock/wait-free stacks/queues distributes
    that pressure. There are single-producer/single consumer (SPSC), multi-producer-single consumer (MPSC), all the way to (MPMC)... Each
    algo is optimized for its usage pattern. Design for clever distribution
    first, then pick your atomics. Actually if we map the per threads to the
    hyper threads, cores, cpu's and numa nodes in a fractal way, then we can
    scale fairly well... Gain the topology and localize it in a sense...

    For something like a shared counter, I'd go thread-local first,
    aggregated per-CPU, then per-NUMA node, etc. Keep sharing as local as
    possible — affinity masks in user space to map the topology, and if you
    must share outside per-thread variables, share local, as local as
    possible...

    So for the clean-slate question: memory-atomic where available and
    applicable, because it avoids the quadratic cache line transfer problem
    you describe under contention. But core-atomic where you need the
    flexibility, with the programmer taking responsibility for alignment and distribution. That can be an issue... The programmers MUST know what
    they are doing! ;^o

    On the auxiliary question: yes, I believe memory-atomic events are
    atomic in the Lamport sense?, provided the package leaves the core sequentially consistent — any observer sees only the before or after
    state, which is exactly what Lamport requires.

    Then there is the SPARC in RMO mode. I have a lot of experience with
    that sucker back in the say when Sun gave me a T2000 for the CoolThreads contest and my vzoom project. All the atomic RMW's are relaxed and it
    uses its fancy MEMBAR instruction to gain the exact membars that are
    needed. For instance, try to avoid #StoreLoad at all costs. Only use
    them when needed. Then there is the async, or remote membars ala async lock/wait-free algos. Are you familiar with FlushProcessWriteBuffers on Windows? Afaict, its Microsoft's way to gain exotic algos, like some
    variants of RCU?

    https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers


    Actually, before any of that, try to gain a embarrassingly parallel
    version of your system and only share when a gun is to your head... ;^)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun May 31 20:57:25 2026
    From Newsgroup: comp.arch

    On 5/28/26 3:37 PM, Chris M. Thomasson wrote:
    On 5/27/2026 2:44 PM, MitchAlsup wrote:
    [snip atomic op context]
    Question: If you were starting lout with a clean slate <as it
    were>,
    and not knowing if the atomic-shared-library is to be used in
    places
    where core count was unknown AND interference level was unknown::

    Would you implement the atomic-shared-library using
    core-atomic inst-
    ructions or would you implement the library using
    memory-atomic inst-
    ructions ?? Or would you implement some of the library as
    core-atomic
    and other parts as memory-atomic?? and why!

    Auxiliary question: Given the definition <above> of
    memory-atomic::
    are not all to-memory packaged events not ATOMIC in the
    Lamport sense
    as long as the package leaves the core sequentially consistent ??

    Both, but the choice is almost "secondary" in a sense as to
    whether the programmer has done their job first... Also, well, I
    am quite fond of relaxed systems. Shit. Perhaps not at relaxed
    as an alpha, wrt data-dependent loads.

    Before you even pick the instruction class, the programmer must
    eliminate false sharing. If the target is an LL/SC architecture,
    they really _need_ to know the reservation granule size and
    pad/align accordingly. For RMW like LOCK XADD, never straddle a
    l2 cache line boundary — pad and align to it. These aren't
    optional courtesies, they're prerequisites...?

    If the operation is performed "centrally" (owning cache, I/O
    device), then cache block size may not be important: the cache
    can avoid performance problems associated with false sharing
    (the singular cache/device acts as a single agent performing
    all the operations).

    While it would not add significant latency (or significantly
    reduce throughput) — I think —, it _seems_ that it would also be
    possible for different levels of cache to have different
    granularities with respect to exported atomic operations. E.g.,
    a cluster of cores sharing L2 cache might use a different
    granule for exported atomic operations than L3 (shared by
    multiple clusters). A smaller L3 granule would seem to allow for
    high contention among many distributed threads with the
    assumption that contention within a cluster is necessarily
    limited; this might be more a concern when L3 slices are
    address-distributed rather than having multiple L3s.

    SRAM array width seems a natural granule for optimization as
    that is the physical read and write size (typically).

    On the other hand, there seem to be other local cache conflict
    factors (like bank conflicts) unrelated to "reservation
    granule" that such is really probably not a consideration for
    software optimization.

    One might also have cases where an exported (word granule)
    atomic operation is in conflict with a thread-performed (cache
    block granule) atomic operation. If the cache block is owned
    outside of the thread locally performing an atomic operation,
    the exported atomic might be granted priority if the ownership
    transfer was still queued, but there might be cache block
    ping-pong with thread-performing atomics repeatedly taking
    ownership. This implies that consistency with the type of
    atomic operation could be important.

    For addition, the operations could be coalesced as they pass
    toward the owner in the network (and "split" as the result is
    returned to individual agents/threads). Such might only make
    sense for fire-and-forget operations (like software performance
    counters or increasing reference counts) if the centralized
    buffering is sufficient. Distributing the operations might
    reduce network bandwidth by splitting a message into multiple
    messages closer to their return destinations rather than having
    multiple messages sent back from the central source, but having
    multiple operations on the same data within the queuing delay
    seems an uncommon case.

    Decreasing reference counts are interesting in that the
    interesting case is when the count becomes zero and a resource
    freeing operation may be performed. Generally(?) this side
    effect would not be synchronous with the thread changing the
    reference count, so there could be some sense in allowing such
    special cases to be caught outside the thread like a I/O event.
    Of course, if resources were being released because they were
    scarce and wanted by the releasing thread, then there might be
    a latency penalty associated with sending the freeing work to
    another hardware context.

    Even with ordinary stores that might be read by other threads
    (or later in a threads execution under out-of-order execution),
    there seems to be some diversity of temporal distance in access
    which might be exploited.

    There might also be "just-a-peek" memory reads that are used for
    general decision making (like reading the consumer pointer of a
    buffer to measure roughly how full the buffer is and whether
    work priorities should be changed). While it might not be
    practical to allow these to have inconsistent ordering (in terms
    of software and hardware complexity), there may be optimization
    opportunities in exploiting such memory accesses. (Such might
    also be useful for value prediction if coherence/ordering
    correctness introduces excessive delay, getting a speculative
    value earlier may sometimes be useful.)

    I remember a paper suggesting assigning limited lifetimes to
    cache block ownership to reduce coherence traffic; once the loan
    period ended, the block would no longer be held. A "peek" might
    be considered an infinitesimal length loan.


    Actually, an arch that has both LL/SC and atomic RMW, well,
    that's pretty nice. I think AMD has that? Give the programmer
    the ability to choose.

    Beyond that, the deeper answer is: don't hammer a single
    location at all! damn it. A single lock-free stack under
    contention is a problem regardless of whether you're using
    core-atomic or memory-atomic instructions.

    I would guess that there are software development tradeoffs in
    how much effort is made to avoid conflicts (and promote
    scalability). Sometimes having a global lock may make sense.

    A hash table of
    lock/wait-free stacks/queues distributes that pressure. There
    are single-producer/single consumer (SPSC),
    multi-producer-single consumer (MPSC), all the way to (MPMC)...
    Each algo is optimized for its usage pattern. Design for clever
    distribution first, then pick your atomics. Actually if we map
    the per threads to the hyper threads, cores, cpu's and numa
    nodes in a fractal way, then we can scale fairly well... Gain
    the topology and localize it in a sense...

    For something like a shared counter, I'd go thread-local first,
    aggregated per-CPU, then per-NUMA node, etc. Keep sharing as
    local as possible — affinity masks in user space to map the
    topology, and if you must share outside per-thread variables,
    share local, as local as possible...

    Yet the "topology" changes if the atomic operations are
    exported. If all shared counter operations are exported for a
    particular address and the hardware uses aggressive
    optimizations, then sharing a counter may be sufficiently
    scalable with benefits in software simplicity (move complexity
    to hardware☺) or other areas.

    [snip remaining interesting content]
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Mon Jun 1 15:32:28 2026
    From Newsgroup: comp.arch

    On 5/31/2026 5:57 PM, Paul Clayton wrote:
    On 5/28/26 3:37 PM, Chris M. Thomasson wrote:
    On 5/27/2026 2:44 PM, MitchAlsup wrote:
    [snip atomic op context]
    Question: If you were starting lout with a clean slate <as it were>,
    and not knowing if the atomic-shared-library is to be used in places
    where core count was unknown AND interference level was unknown::

    Would you implement the atomic-shared-library using core-atomic inst-
    ructions or would you implement the library using memory-atomic inst-
    ructions ?? Or would you implement some of the library as core-atomic
    and other parts as memory-atomic?? and why!

    Auxiliary question: Given the definition <above> of memory-atomic::
    are not all to-memory packaged events not ATOMIC in the Lamport sense
    as long as the package leaves the core sequentially consistent ??

    Both, but the choice is almost "secondary" in a sense as to whether
    the programmer has done their job first... Also, well, I am quite fond
    of relaxed systems. Shit. Perhaps not at relaxed as an alpha, wrt
    data-dependent loads.

    Before you even pick the instruction class, the programmer must
    eliminate false sharing. If the target is an LL/SC architecture, they
    really _need_ to know the reservation granule size and pad/align
    accordingly. For RMW like LOCK XADD, never straddle a l2 cache line
    boundary — pad and align to it. These aren't optional courtesies,
    they're prerequisites...?

    If the operation is performed "centrally" (owning cache, I/O
    device), then cache block size may not be important: the cache
    can avoid performance problems associated with false sharing
    (the singular cache/device acts as a single agent performing
    all the operations).

    While it would not add significant latency (or significantly
    reduce throughput) — I think —, it _seems_ that it would also be
    possible for different levels of cache to have different
    granularities with respect to exported atomic operations. E.g.,
    a cluster of cores sharing L2 cache might use a different
    granule for exported atomic operations than L3 (shared by
    multiple clusters). A smaller L3 granule would seem to allow for
    high contention among many distributed threads with the
    assumption that contention within a cluster is necessarily
    limited; this might be more a concern when L3 slices are
    address-distributed rather than having multiple L3s.

    SRAM array width seems a natural granule for optimization as
    that is the physical read and write size (typically).

    On the other hand, there seem to be other local cache conflict
    factors (like bank conflicts) unrelated to "reservation
    granule" that such is really probably not a consideration for
    software optimization.

    One might also have cases where an exported (word granule)
    atomic operation is in conflict with a thread-performed (cache
    block granule) atomic operation. If the cache block is owned
    outside of the thread locally performing an atomic operation,
    the exported atomic might be granted priority if the ownership
    transfer was still queued, but there might be cache block
    ping-pong with thread-performing atomics repeatedly taking
    ownership. This implies that consistency with the type of
    atomic operation could be important.

    For addition, the operations could be coalesced as they pass
    toward the owner in the network (and "split" as the result is
    returned to individual agents/threads). Such might only make
    sense for fire-and-forget operations (like software performance
    counters or increasing reference counts) if the centralized
    buffering is sufficient. Distributing the operations might
    reduce network bandwidth by splitting a message into multiple
    messages closer to their return destinations rather than having
    multiple messages sent back from the central source, but having
    multiple operations on the same data within the queuing delay
    seems an uncommon case.

    Decreasing reference counts are interesting in that the
    interesting case is when the count becomes zero and a resource
    freeing operation may be performed. Generally(?) this side
    effect would not be synchronous with the thread changing the
    reference count, so there could be some sense in allowing such
    special cases to be caught outside the thread like a I/O event.
    Of course, if resources were being released because they were
    scarce and wanted by the releasing thread, then there might be
    a latency penalty associated with sending the freeing work to
    another hardware context.

    Even with ordinary stores that might be read by other threads
    (or later in a threads execution under out-of-order execution),
    there seems to be some diversity of temporal distance in access
    which might be exploited.

    There might also be "just-a-peek" memory reads that are used for
    general decision making (like reading the consumer pointer of a
    buffer to measure roughly how full the buffer is and whether
    work priorities should be changed). While it might not be
    practical to allow these to have inconsistent ordering (in terms
    of software and hardware complexity), there may be optimization
    opportunities in exploiting such memory accesses. (Such might
    also be useful for value prediction if coherence/ordering
    correctness introduces excessive delay, getting a speculative
    value earlier may sometimes be useful.)

    I remember a paper suggesting assigning limited lifetimes to
    cache block ownership to reduce coherence traffic; once the loan
    period ended, the block would no longer be held. A "peek" might
    be considered an infinitesimal length loan.


    Actually, an arch that has both LL/SC and atomic RMW, well, that's
    pretty nice. I think AMD has that? Give the programmer the ability to
    choose.

    Beyond that, the deeper answer is: don't hammer a single location at
    all! damn it. A single lock-free stack under contention is a problem
    regardless of whether you're using core-atomic or memory-atomic
    instructions.

    I would guess that there are software development tradeoffs in
    how much effort is made to avoid conflicts (and promote
    scalability). Sometimes having a global lock may make sense.

    A hash table of lock/wait-free stacks/queues distributes that
    pressure. There are single-producer/single consumer (SPSC), multi-
    producer-single consumer (MPSC), all the way to (MPMC)... Each algo is
    optimized for its usage pattern. Design for clever distribution first,
    then pick your atomics. Actually if we map the per threads to the
    hyper threads, cores, cpu's and numa nodes in a fractal way, then we
    can scale fairly well... Gain the topology and localize it in a sense...

    For something like a shared counter, I'd go thread-local first,
    aggregated per-CPU, then per-NUMA node, etc. Keep sharing as local as
    possible — affinity masks in user space to map the topology, and if
    you must share outside per-thread variables, share local, as local as
    possible...

    Yet the "topology" changes if the atomic operations are
    exported. If all shared counter operations are exported for a
    particular address and the hardware uses aggressive
    optimizations, then sharing a counter may be sufficiently
    scalable with benefits in software simplicity (move complexity
    to hardware☺) or other areas.

    [snip remaining interesting content]


    The topology needs to be available to the programmer, full stop in a
    sense... I get the appeal of "let the hardware handle it" but that's not
    good enough? If I can't see the topology, I can't make "informed
    decisions" about "data placement", affinity, and locality... Hardware optimizations are a nice bonus on top of good software design. Shit.
    Howerver, they're not a substitute for it? And if I can't reason about
    the shape of the system, I can't even make an informed decision about
    when a global lock is acceptable versus when I need a distributed
    structure. That tradeoff requires knowledge, not abstraction.

    Also, things do NOT need to be all seq_cst. Shit. My SPARC/RMO
    experience is directly relevant here. All the atomic RMWs are relaxed,
    and you use MEMBAR to express exactly the ordering you need and nothing
    more. Avoid #StoreLoad at all costs — only pay for it when the algorithm genuinely demands it. That discipline forces you to think carefully
    about what guarantees you actually need between specific threads, which
    is a feature not a bug.

    And here's where the two points connect: if you understand the topology,
    you know which threads share which data, which tells you what ordering guarantees you actually need between them. Abstract away the topology
    and you're pushed toward conservative, expensive, blanket seq_cst
    everywhere because you can't reason about anything more precise. Expose
    the topology and you can be surgical about it.

    Relaxed where you can. Stronger ordering only where you must. And always knowing the shape of the system you're running on. ;^)

    Also, for x86 yes we need to pad and align for the l2 cache line anyway.
    So shit happens?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 01:18:36 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 5/31/2026 5:57 PM, Paul Clayton wrote:
    On 5/28/26 3:37 PM, Chris M. Thomasson wrote:
    On 5/27/2026 2:44 PM, MitchAlsup wrote:
    ----------------
    [snip remaining interesting content]


    The topology needs to be available to the programmer, full stop in a sense... I get the appeal of "let the hardware handle it" but that's not good enough? If I can't see the topology, I can't make "informed
    decisions" about "data placement", affinity, and locality...

    So, what kind of topology is an appropriate description of a system
    as small as 1 core, 1 main bus, and 1 PCIe tree or as large as 1M
    cores, 1024 independent interconnects, and <essentially> unbounded
    number of PCIe trees (such as a basketball stadium sized server) ???

    Hardware optimizations are a nice bonus on top of good software design. Shit. Howerver, they're not a substitute for it? And if I can't reason about
    the shape of the system, I can't even make an informed decision about
    when a global lock is acceptable versus when I need a distributed
    structure.

    What if there is no concept of a global lock--except one created by
    SW's use of the topology in order to bring semblance of order into
    which there is essentially none ??

    That tradeoff requires knowledge, not abstraction.

    Agreed, but do any of us have the vocabulary to adequately describe it ??

    Also, things do NOT need to be all seq_cst. Shit. My SPARC/RMO
    experience is directly relevant here. All the atomic RMWs are relaxed,
    and you use MEMBAR to express exactly the ordering you need and nothing more. Avoid #StoreLoad at all costs — only pay for it when the algorithm genuinely demands it. That discipline forces you to think carefully
    about what guarantees you actually need between specific threads, which
    is a feature not a bug.

    What percent of programmers are capable of absorbing this kind of
    knowledge and making effective use of it ??

    And here's where the two points connect: if you understand the topology,
    you know which threads share which data, which tells you what ordering guarantees you actually need between them.

    Say your system is as big as the entire galaxy where it takes 100,000 years
    for a signal from one side to reach the other side (and another 100,000
    years coming back) yet because of its size you get 10**(10**100) times the compute power you have on earthy scale machines; can you think of an
    adequate abstraction that would allow SW to make use of this scale of
    a system that is not already embarrassingly parallel ??

    Abstract away the topology
    and you're pushed toward conservative, expensive, blanket seq_cst
    everywhere because you can't reason about anything more precise. Expose
    the topology and you can be surgical about it.

    Relaxed where you can. Stronger ordering only where you must. And always knowing the shape of the system you're running on. ;^)

    Consider the galaxy scale system mentioned above. What kind of topology
    will be adequate for that kind of scale ??

    Also, for x86 yes we need to pad and align for the l2 cache line anyway.
    So shit happens?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 13:45:13 2026
    From Newsgroup: comp.arch

    On 6/1/2026 6:18 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:
    [...]
    Consider the galaxy scale system mentioned above. What kind of topology
    will be adequate for that kind of scale ??

    Sorry for the quick response, I am really busy right now. Will get back
    to you with more detail sometime tonight.

    Galaxy level? The super massive black hole. Its smaller black holes
    wandering the dust lanes? The stars, the solar systems, the planets in
    orbit. Down to the per-thread level...

    Say, for starters:

    per-thread
    per-cpu
    per-numa

    finally go global only if we have to!

    Also, try to avoid sharing when possible.


    As for programmers that know this shit, well, they are few and far
    between. But, that does not mean, to borrow a phrase from Joe Seigh...


    When we get hardware we make software.
    When we get lemons we make lemonade.

    I love RMO! The Alpha is too strict, but shit happens. Alpha is too
    strict in the sense that it does not honor a load dependency. RCU guys
    know all about it... shit happens...






    Also, for x86 yes we need to pad and align for the l2 cache line anyway.
    So shit happens?

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 13:47:27 2026
    From Newsgroup: comp.arch

    On 6/2/2026 1:45 PM, Chris M. Thomasson wrote:
    On 6/1/2026 6:18 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:
    [...]
    Consider the galaxy scale system mentioned above. What kind of topology
    will be adequate for that kind of scale ??

    Sorry for the quick response, I am really busy right now. Will get back
    to you with more detail sometime tonight.

    Galaxy level? The super massive black hole. Its smaller black holes wandering the dust lanes? The stars, the solar systems, the planets in orbit. Down to the per-thread level...

    Say, for starters:

    per-thread
    per-cpu
    per-numa

    finally go global only if we have to!

    Also, try to avoid sharing when possible.


    As for programmers that know this shit, well, they are few and far
    between. But, that does not mean, to borrow a phrase from Joe Seigh...


    When we get hardware we make software.
    When we get lemons we make lemonade.

    I love RMO! The Alpha is too strict, but shit happens. Alpha is too
    strict in the sense that it does not honor a load dependency. RCU guys
    know all about it... shit happens...





    Also, for x86 yes we need to pad and align for the l2 cache line anyway. >>> So shit happens?


    ;^)

    per-thread
    per-hyperthread/SMT slot
    per-core
    per-CPU (socket)
    per-NUMA node
    per-interconnect domain
    per-rack
    per-datacenter

    ...per-galaxy
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Sun Jun 7 15:23:15 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    This is mainly for Chris M. Thomasson but everyone is free to pile on::

    Given a normal chip with multiple cores and a typical cache hierarchy
    with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties
    see that the unit of work was either never started or finished completely >without any intermediate state being observed by any party other than the >core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >fall under this definition.

    define: memory-atomic as a unit of work identified in a core and packaged
    up to be performed by the cache hierarchy or PCIe device control registers. >Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, >CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >of-work are performed in the cache having write-permission {EXCLUSIVE or >MODIFIED cache line states}. When unCacheable, these units of work are >performed directly on the addressed control register. In both cases, any >interested party sees only the previous state or the after state.

    I understand why some folks like memory-atomic--it solves the
    transaction storm issue when a location is heavily contended. The
    tradeoff is making the atomic action very slow all the time even when
    there's no contention. To simplify discussion, I'm just going to call
    the contended atomic location a "lock", even if it isn't really one.

    The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
    atomic and it's not in the cache, then it sends the atomic operation
    into the fabric to let some other central thing do the atomic operation.
    And your core-atomic is just a way to say the atomic is done by a CPU
    once it gets the cacheline Exclusive.

    I feel the best rule for atomic design (for hardware and software) is to
    always optimize the non-contended case, and try to make the
    non-contended case as likely as possible. The core-atomic is therefore
    what should be planned for and optimized for.

    The problem with memory-atomic/Far-atomic is if the lock is on the same cacheline as the data its protecting (which is pretty common), then
    there is a lot of extra memory traffic pushing this around. To do the Far-atomic needs to snoop that data out, do the atomic, return the
    atomic result, and then the initiating CPU still needs to fetch the
    line. This is all wasted time, especially when there's no contention! The usual solution to contention is lots of locks, so the chance that any of
    them will be in the cache is low.

    There are ways to reduce the transaction storm around a contended lock,
    but they require hardware to get more involved in the entire
    "transaction". To take an example, assume a mutex protects some piece
    of data which needs some software manipulation (think the head of a
    linked list, and we're going to insert at the head). If written out as
    an explicit transaction, where software has to interact with the platform-specific way to do an atomic transaction (and with all the restrictions), this is high overhead for software, and does NOT match how
    most software is written.

    But hardware could dynamically detect transactions when done as
    core-atomic. Assume an atomic operation is the start of a
    "transaction", for instance where we get a lock, then do something
    while owning the lock, and then release the lock. Once we bring in that cacheline Exclusive and do the atomic operation, a Tentative Transaction starts. You block responding to snoops until we decide we have to. We
    can do misses and whatnot, but after say 500 clocks, we allow snoops
    again (to avoid deadlocks). The transaction ends when the atomic location
    is written again.

    What this achieves is an end to thrashing. When it's working, the transaction-prediction hardware lets the thread get the lock, and do its
    work, and release the lock, in one go, not responding to outside snoops.
    When it does respond to the snoops, the first snooper immediately wins the
    lock and gets its work done, and so on. For this to work easily, the
    software should NOT peek at the lock location with a load, to "see" if
    it's free. It should just grab it.

    Add a Transaction-prediction structure, to mark which atomics seem to be
    the start of a transaction, and which ones are not, and roughly what that transaction looks like. And then you've solved the cache thrashing of locks--after seeing an atomic a few times, you learn what it's doing, and
    can try to make it succeed each time it executes. And how long each transaction seems to be, the raise the limit for blocking snoops to
    allow it to be more likely to succeed.

    To handle the load peeking at a lock to see if its free requires
    the Transaction-prediction hardware to detect this load and effectively
    turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
    and it becomes the start of the transaction.

    This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing. If the software algorithm is poor,
    where there's a big lock protecting a lot of work, then it exceeds the hardware's ability to do efficiently, and the hardware instead works like
    other platforms, and there can be thrashing.

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Jun 7 18:34:11 2026
    From Newsgroup: comp.arch


    [email protected] (Kent Dickey) posted:

    Thank you Kent, much valuable information below.

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    This is mainly for Chris M. Thomasson but everyone is free to pile on::

    Given a normal chip with multiple cores and a typical cache hierarchy
    with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties >see that the unit of work was either never started or finished completely >without any intermediate state being observed by any party other than the >core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >fall under this definition.

    define: memory-atomic as a unit of work identified in a core and packaged >up to be performed by the cache hierarchy or PCIe device control registers. >Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, >CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >of-work are performed in the cache having write-permission {EXCLUSIVE or >MODIFIED cache line states}. When unCacheable, these units of work are >performed directly on the addressed control register. In both cases, any >interested party sees only the previous state or the after state.

    I understand why some folks like memory-atomic--it solves the
    transaction storm issue when a location is heavily contended. The
    tradeoff is making the atomic action very slow all the time even when
    there's no contention. To simplify discussion, I'm just going to call
    the contended atomic location a "lock", even if it isn't really one.

    LL/SCs are designed to take no more than LD/ST pipeline time outside of contention.

    The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
    atomic and it's not in the cache, then it sends the atomic operation
    into the fabric to let some other central thing do the atomic operation.
    And your core-atomic is just a way to say the atomic is done by a CPU
    once it gets the cacheline Exclusive.

    What do you do when you have more than 1 cache line--and all lines need
    to be in an exclusive state prior to modifying concurrent data ??

    I feel the best rule for atomic design (for hardware and software) is to always optimize the non-contended case, and try to make the
    non-contended case as likely as possible. The core-atomic is therefore
    what should be planned for and optimized for.

    OK.

    The problem with memory-atomic/Far-atomic is if the lock is on the same cacheline as the data its protecting (which is pretty common), then
    there is a lot of extra memory traffic pushing this around.

    The whole reason for {memory,far}-atomic is to avoid moving data around
    the interconnect inorder to perform an event/transaction.

    To do the Far-atomic needs to snoop that data out, do the atomic, return the
    atomic result, and then the initiating CPU still needs to fetch the
    line. This is all wasted time, especially when there's no contention! The usual solution to contention is lots of locks, so the chance that any of
    them will be in the cache is low.

    Fair point-but Chris M. Thomasson makes the point that the locks should not
    be placed together--your position is that this happens "more or less" all
    the time. Hmmmmm

    There are ways to reduce the transaction storm around a contended lock,
    but they require hardware to get more involved in the entire
    "transaction". To take an example, assume a mutex protects some piece
    of data which needs some software manipulation (think the head of a
    linked list, and we're going to insert at the head). If written out as
    an explicit transaction, where software has to interact with the platform-specific way to do an atomic transaction (and with all the restrictions), this is high overhead for software, and does NOT match how most software is written.

    But hardware could dynamically detect transactions when done as
    core-atomic. Assume an atomic operation is the start of a
    "transaction", for instance where we get a lock, then do something
    while owning the lock, and then release the lock. Once we bring in that cacheline Exclusive and do the atomic operation, a Tentative Transaction starts. You block responding to snoops until we decide we have to. We

    In CAS atomics, the LDs which start the lock acquire are not special
    in any way; making it hard for core to detect. LL, on the other hand,
    tells core that something special is beginning. This reduces contention
    because the LD (of CAS) is performed without "for modification" and
    thus the lines can be shared. But at CAS, the lines have to become
    Exclusive.

    can do misses and whatnot, but after say 500 clocks, we allow snoops
    again (to avoid deadlocks). The transaction ends when the atomic location
    is written again.

    In a system with multiple-chips each containing dozens of cores, 500
    cycles might not be enough to even send a CI around the interconnect.

    What this achieves is an end to thrashing. When it's working, the transaction-prediction hardware lets the thread get the lock, and do its work, and release the lock, in one go, not responding to outside snoops.

    My 66000 ships <core> priority around with memory accesses so that the
    remote cache can determine if it fails or if requestor fails. Helping
    higher priority atomic-events at the cost to lower priority events.

    But also; if core reaches the beginning of the stores (to participating
    cache lines) while all this data traffic is sorting itself out, then,
    core gains the ability to NaK SNOOP requests (priority based) to
    participating lines until the event finishes with the SC to the final
    variable in memory.

    My 66000 ISA contains the necessary instruction semantics to perform core-atomic events over multiple participating cache lines (up to 8).

    When it does respond to the snoops, the first snooper immediately wins the lock and gets its work done, and so on. For this to work easily, the software should NOT peek at the lock location with a load, to "see" if
    it's free. It should just grab it.

    This seems to me to be wrongly optimistic!! Another core attempting a transaction on a concurrent data structure, and the attempt fails
    needs to understand that the CDS is no longer in the state this core
    read out at the beginning, and instead of trying to complete, it should
    take a fresh look at CDS. For example several cores try to deQueue
    work from the front of a Queue. One core "wins" and deQueues the front
    unit of work. The others need to know the Queue element they were going
    after is not there are fail to a point where they take a fresh look at
    the Queue.

    Add a Transaction-prediction structure, to mark which atomics seem to be
    the start of a transaction, and which ones are not, and roughly what that transaction looks like. And then you've solved the cache thrashing of locks--after seeing an atomic a few times, you learn what it's doing, and
    can try to make it succeed each time it executes. And how long each transaction seems to be, the raise the limit for blocking snoops to
    allow it to be more likely to succeed.

    How are you going to weave in that some cores have higher priority than
    others? You don't want a low priority queue NaKing snoops which would
    allow the higher priority core to make forward progress at the expense
    of the lower priority cores. As an example, the low priority event is registering that the left-rear tire is low on air pressure, while the
    high priority event is registering panic-braking to avoid a collision.

    To handle the load peeking at a lock to see if its free requires
    the Transaction-prediction hardware to detect this load and effectively
    turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
    and it becomes the start of the transaction.

    My 66000 can respond to Read-with-Intent with here is data, but it is Shared not Exclusive (based on priority at location of cache line).

    This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing. If the software algorithm is poor,
    where there's a big lock protecting a lot of work, then it exceeds the hardware's ability to do efficiently, and the hardware instead works like other platforms, and there can be thrashing.

    I agree with most of the functionality of the transaction-predictor,
    but it is already built into My 66000 ISA--and is not a predictor,
    but mechanics.

    Kent

    THanks again.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Sun Jun 7 12:14:24 2026
    From Newsgroup: comp.arch

    On 6/7/2026 8:23 AM, Kent Dickey wrote:
    In article <[email protected]>

    Sorry for the big snip...

    [...]


    Fwiw, from my personal experience, two locks need to be padded and
    aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
    the reservation granule. Get the granule size wrong on LL/SC and your SC
    can cause some interesting harm (a lot of sc failures, etc.) to another
    core's reservation even when the locks have no logical relationship to
    each other...

    So if your algo is using LL/SC you need to worry about the reservation granule.

    If its using something akin to LOCK RMW, well, the l2 cache line is your friend.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Mon Jun 8 20:21:52 2026
    From Newsgroup: comp.arch

    In article <1104fui$2ns6j$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/7/2026 8:23 AM, Kent Dickey wrote:
    In article <[email protected]>

    Sorry for the big snip...

    [...]


    Fwiw, from my personal experience, two locks need to be padded and
    aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
    the reservation granule. Get the granule size wrong on LL/SC and your SC
    can cause some interesting harm (a lot of sc failures, etc.) to another >core's reservation even when the locks have no logical relationship to
    each other...

    So if your algo is using LL/SC you need to worry about the reservation >granule.

    If its using something akin to LOCK RMW, well, the l2 cache line is your >friend.

    Yes, different mutexes/locks/etc. need to be spread apart from each other,
    to avoid accidental contention. However, if you have a huge number of mutexes/locks/etc. then it's less important, since by design you're
    spreading accesses out over "enough", so you're less likely to get any contention. If you have 300 CPUs and 10,000 database rows with a mutex
    per row, then contention is not likely a big deal.

    My quick look at C++:

    #include <mutex>
    std::mutex sample_mutex;

    show sample_mutex is 48 bytes. This includes OS/thread overhead to put the thread to sleep if the fast-grab algorithm doesn't work after a few tries, along with some internal alignment padding.

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Mon Jun 8 21:36:34 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    [email protected] (Kent Dickey) posted:

    Thank you Kent, much valuable information below.
    [snip]
    I understand why some folks like memory-atomic--it solves the
    transaction storm issue when a location is heavily contended. The
    tradeoff is making the atomic action very slow all the time even when
    there's no contention. To simplify discussion, I'm just going to call
    the contended atomic location a "lock", even if it isn't really one.

    LL/SCs are designed to take no more than LD/ST pipeline time outside of >contention.

    The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
    atomic and it's not in the cache, then it sends the atomic operation
    into the fabric to let some other central thing do the atomic operation.
    And your core-atomic is just a way to say the atomic is done by a CPU
    once it gets the cacheline Exclusive.

    What do you do when you have more than 1 cache line--and all lines need
    to be in an exclusive state prior to modifying concurrent data ??

    Terminology is tricky here. I was using "atomic" to be the individual
    atomic instruction, like Compare-and-Swap, or the operation(s) to achieve
    a mutex (which might be an atomic instruction and 2-3 others).

    Atomics can be used for lock-free algorithms, such as producer/consumer adjusting FIFO ptrs, where there's no spinning, and the atomic operation
    is not followed by a "release" of some sort.

    So I wanted to talk just about mutexes, where we need an atomic instruction
    (or 2-3) to enter a critical area, then we do some amount of work which
    is just normal code, then we "release" the mutex.

    And what I was trying to get across is it's best if these are done local to
    the CPU (and not as Far Atomics), and that hardware can do some fairly
    simple things to solve the contention storm issue, using some history
    logic to remember how each atomic behaves.

    It's my opinion there is no good use case for Far-Atomics/Memory-Atomics,
    but that's a bigger issue I don't have time to explain in detail. Well,
    if you have non-CPUs who need coordination with CPUs, Far-Atomics/Memory- Atomics might make sense, but that's a special use case.

    I feel the best rule for atomic design (for hardware and software) is to
    always optimize the non-contended case, and try to make the
    non-contended case as likely as possible. The core-atomic is therefore
    what should be planned for and optimized for.

    OK.

    The problem with memory-atomic/Far-atomic is if the lock is on the same
    cacheline as the data its protecting (which is pretty common), then
    there is a lot of extra memory traffic pushing this around.

    The whole reason for {memory,far}-atomic is to avoid moving data around
    the interconnect inorder to perform an event/transaction.

    Let's say I have a mutex, protecting a memory-resident logging facility. A thread needs to win the mutex, and then use the logging structure to
    write a message, where there's a buffer, a buffer_length, and a
    current_offset. In the rare case where the buffer fills, the thread needs
    to flush the buffer out, and reset the offset. When done, release the mutex.

    #include <mutex>
    struct {
    std::mutex mutex;
    char *buffer_ptr;
    int buffer_size;
    int buffer_offset;
    } Logger;

    The mutex contains an internal "lock" variable, where 0 means free, and non-0 means some thread has the mutex. So grabbing "lock" means do a Compare-and-Swap, checking that it is 0, and writing in our thread info if
    we win.

    So the logging function is:

    Logger::log(char *str, int len)
    {
    int new_len;

    mutex.lock(); // Acquire the lock
    new_len = MIN(buffer_size - buffer_offset, len);
    memcpy(&buffer_ptr[buffer_offset], str, new_len);
    buffer_offset += new_len;
    if(new_len >= buffer_size) {
    str += new_len;
    len -= new_len;
    // Flush buffer, etc. etc.
    (void)write(logfd, buffer_ptr, buffer_size);
    buffer_offset = 0;
    if(len) {
    memcpy(&buffer_ptr[0], str, len);
    }
    }
    mutex.unlock();
    }

    Grabbing the mutex is the "atomic" operation. Doing the entire log()
    function is the Transaction, from mutex.lock() through mutex.unlock(). mutex.unlock() is basically just a store to the lock address to write it
    to 0 (but it might involve waking up threads, etc.).

    So what I'm proposing is hardware detects that there is a "Transaction"
    from mutex.lock() through mutex.unlock(). And it tries to hold off
    snoops to the mutex lock address until it completes all the work inside.
    What this achieves is the snoop response is then the lock address with
    the lock indicating "free", so the next CPU immediately wins the lock
    and can do it's log() work. But if it takes "too long" in log(), then
    the CPU lets snoops through (which is needed to avoid system deadlocks,
    in general), and then it can expose the mutex.lock state while log() is
    still running.

    And this solves the contention storm. The contention storm happens when
    other threads see the mutex.lock variable as locked. In fact, any read
    to the mutex.lock variable which shows it locked is wasted effort. So,
    if the CPU which wins the mutex can hold off snoops until it unlocks the
    mutex, then the contention storm goes away. The mutex just nicely
    passes to other CPUs one at a time.

    The idea is not to JUST do the atomic operation, but to try to do the
    entire log() function at once--but it's NOT required to always succeed
    in doing it all at once. If hw can just do it 99% of the time, that
    would be a HUGE win. The atomic mutex.lock() and mutex.unlock() must
    always be atomic, but if hardware can avoid letting other CPUs see the
    hidden lock variable in the locked state most of the time, it fixes the contention storm.

    We cannot always succeed in holding off snoops. The CPU which wins the
    mutex and then tries to access *buffer_ptr might deadlock the system if
    we don't allow snoops to us to proceed under certain traffic conditions,
    even in the simplest of cases. In this case, clearly the write() system
    call requires snoops to be allowed in.

    So we need a timer, and then in the generally rare case when we cannot
    just get it all done at once, let snoops happen, and avoid deadlock.
    This does cause the start of a contention storm (all CPUs read the lock,
    see it's busy, and then try to grab it), but as soon as the next CPU
    wins the mutex, the system can go back to the orderly passing to the next
    CPU.

    To do the
    Far-atomic needs to snoop that data out, do the atomic, return the
    atomic result, and then the initiating CPU still needs to fetch the
    line. This is all wasted time, especially when there's no contention! The >> usual solution to contention is lots of locks, so the chance that any of
    them will be in the cache is low.

    Fair point-but Chris M. Thomasson makes the point that the locks should not >be placed together--your position is that this happens "more or less" all
    the time. Hmmmmm

    It's important that different Mutex'es be on different cachelines. But
    look at my code example: I have a structure with a mutex and with other
    data. If I win the mutex, then the FIRST thing I need is to access that
    other data. So it would be great if winning the mutex brought that data
    into cache as well.

    This is why I think Far-Atomics/Memory-Atomics are not good ideas. If mutex.lock() is done at the memory controller, then the CPU has to bring
    in that cache line ANYWAY right afterwords.

    There are ways to reduce the transaction storm around a contended lock,
    but they require hardware to get more involved in the entire
    "transaction". To take an example, assume a mutex protects some piece
    of data which needs some software manipulation (think the head of a
    linked list, and we're going to insert at the head). If written out as
    an explicit transaction, where software has to interact with the
    platform-specific way to do an atomic transaction (and with all the
    restrictions), this is high overhead for software, and does NOT match how
    most software is written.

    But hardware could dynamically detect transactions when done as
    core-atomic. Assume an atomic operation is the start of a
    "transaction", for instance where we get a lock, then do something
    while owning the lock, and then release the lock. Once we bring in that
    cacheline Exclusive and do the atomic operation, a Tentative Transaction
    starts. You block responding to snoops until we decide we have to. We

    In CAS atomics, the LDs which start the lock acquire are not special
    in any way; making it hard for core to detect. LL, on the other hand,
    tells core that something special is beginning. This reduces contention >because the LD (of CAS) is performed without "for modification" and
    thus the lines can be shared. But at CAS, the lines have to become >Exclusive.

    It's easy to detect a mutex. Have a Transaction Prediction structure
    indexed by a hash function which includes the address and the
    instruction address, and store the address in the entry along with a bit
    for CAS, and a bit for Mutex. Mark the entry as having a CAS done to
    it, and then when hw sees a store to that some address "soon after",
    upgrade the entry to a mutex. Now, when we do a CAS and look it up,
    we'll see this is a mutex, and to try to hold off snoops until we see
    the store to that address.

    If we only ever see CAS to an address, and always from the same instruction address, then it's not a mutex. It would be useful to detect other
    Transaction types, but this post is already too long to go into other
    cases.

    Mutexes are easy for software to use, easy for humans to understand and
    reason about, and relatively easy for hardware to make very efficient.

    can do misses and whatnot, but after say 500 clocks, we allow snoops
    again (to avoid deadlocks). The transaction ends when the atomic location >> is written again.

    In a system with multiple-chips each containing dozens of cores, 500
    cycles might not be enough to even send a CI around the interconnect.

    I would expand the Transaction Prediction structure to have a few bits indicating the time to hold off snoops. Start low, and if we keep seeing
    the store not happen in time, and yet it still looks like a mutex, keep increasing the delay. This prevents all atomics from causing long snoop
    delay periods, while tuning in enough delay to handle the operations of
    real use cases.

    What this achieves is an end to thrashing. When it's working, the
    transaction-prediction hardware lets the thread get the lock, and do its
    work, and release the lock, in one go, not responding to outside snoops.

    My 66000 ships <core> priority around with memory accesses so that the
    remote cache can determine if it fails or if requestor fails. Helping
    higher priority atomic-events at the cost to lower priority events.

    But also; if core reaches the beginning of the stores (to participating
    cache lines) while all this data traffic is sorting itself out, then,
    core gains the ability to NaK SNOOP requests (priority based) to >participating lines until the event finishes with the SC to the final >variable in memory.

    My 66000 ISA contains the necessary instruction semantics to perform >core-atomic events over multiple participating cache lines (up to 8).

    I'm proposing an idea that can be implemented on any CPU, and requires no special support from the interconnect.

    I think your NAK idea sounds risky. I've dealt a lot in the past with
    buses using retries, and the short answer is it's VERY HARD to implement
    a protocol with retries and not have degenerate starvation cases. Everyone thinks they've got it solved, that it's easy, and it's just not easy. There's a reason retries are removed from almost all modern protocols.

    The problem is if you have 3 CPUs contending, and the way NAKs work one
    CPU might be the unlucky victim forever, with 2 CPUs starving it forever.
    In general, it's best to avoid retry-like behavior. For simple cases, you could build in some starvation detection, but in general that just pushes
    your starvation/livelock case to a more complex case, not fix it.

    When it does respond to the snoops, the first snooper immediately wins the >> lock and gets its work done, and so on. For this to work easily, the
    software should NOT peek at the lock location with a load, to "see" if
    it's free. It should just grab it.

    This seems to me to be wrongly optimistic!! Another core attempting a >transaction on a concurrent data structure, and the attempt fails
    needs to understand that the CDS is no longer in the state this core
    read out at the beginning, and instead of trying to complete, it should
    take a fresh look at CDS. For example several cores try to deQueue
    work from the front of a Queue. One core "wins" and deQueues the front
    unit of work. The others need to know the Queue element they were going
    after is not there are fail to a point where they take a fresh look at
    the Queue.

    I tried to expand my thinking with the log() function example so you can see that each CPU will see the updated state left by the previous one.

    For your CPU, you propose a Transaction sequence with a start and end instruction, and I'm just pointing out you can get basically all of the
    benefit with some prediction hardware picking out mutexes and making them
    just as efficient. But with NONE of the deadlock or forward progress
    issues, since if things go "wrong", you just default to running the code
    as written.

    Add a Transaction-prediction structure, to mark which atomics seem to be
    the start of a transaction, and which ones are not, and roughly what that
    transaction looks like. And then you've solved the cache thrashing of
    locks--after seeing an atomic a few times, you learn what it's doing, and
    can try to make it succeed each time it executes. And how long each
    transaction seems to be, the raise the limit for blocking snoops to
    allow it to be more likely to succeed.

    How are you going to weave in that some cores have higher priority than >others? You don't want a low priority queue NaKing snoops which would
    allow the higher priority core to make forward progress at the expense
    of the lower priority cores. As an example, the low priority event is >registering that the left-rear tire is low on air pressure, while the
    high priority event is registering panic-braking to avoid a collision.

    Priority doesn't come into play at this level. I'm just trying to help
    make the log() function appear to execute as a single unit most of the time. And if "most" becomes "never" for some pathological cases, then in the
    worst case it works just like existing CPUs.

    To handle the load peeking at a lock to see if its free requires
    the Transaction-prediction hardware to detect this load and effectively
    turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
    and it becomes the start of the transaction.

    My 66000 can respond to Read-with-Intent with here is data, but it is Shared >not Exclusive (based on priority at location of cache line).

    This seems like a not very useful thing to do, why would you do that?
    If a CPU says it's going to write to a cacheline, and you give it data but Shared ownership state, it just has to immediately ask for write
    permission again.

    But this doesn't affect my proposal at all. The CPU has to get write permission to do the Compare-and-Swap for the mutex.lock() operation, and
    once it does, then the Transaction functionality kicks in.

    This avoids the need for any software changes, while directly speeding up
    atomics and what atomics are doing. If the software algorithm is poor,
    where there's a big lock protecting a lot of work, then it exceeds the
    hardware's ability to do efficiently, and the hardware instead works like
    other platforms, and there can be thrashing.

    I agree with most of the functionality of the transaction-predictor,
    but it is already built into My 66000 ISA--and is not a predictor,
    but mechanics.

    Kent

    THanks again.

    And I'm proposing no new architecturally visible changes to make some
    lock operations more efficient. So that any CPU could do it.

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Jun 9 00:45:14 2026
    From Newsgroup: comp.arch


    [email protected] (Kent Dickey) posted:

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:
    ------------------------
    My 66000 ISA contains the necessary instruction semantics to perform >core-atomic events over multiple participating cache lines (up to 8).

    I'm proposing an idea that can be implemented on any CPU, and requires no special support from the interconnect.

    I think your NAK idea sounds risky. I've dealt a lot in the past with
    buses using retries, and the short answer is it's VERY HARD to implement
    a protocol with retries and not have degenerate starvation cases.

    Yes, it is tricky, no it is not impossible. The core must be in a state
    that is rather special in order to NaK snoop responses. In fact both
    cores need to paly their parts of the tricky game. The snooped core
    must only NaK when it is able to get all the way through the STs and
    complete the transaction/event; and the snooping core must understand
    if it is attempting an AOMTIC event (or not). If it is, the NaK signals
    failure of the event, if it is not, the NaK has the core retry its NAKed request.

    A NaK retry is on the order of 400-cycles, so the interfering core takes
    the added latency, while the core processing the event completes apace.

    Everyone thinks they've got it solved, that it's easy, and it's just not easy. There's
    a reason retries are removed from almost all modern protocols.

    Do you understand how hard it is to perform Double-Compare-Double-Swap
    without something special in the protocol under withering contention ??

    The problem is if you have 3 CPUs contending, and the way NAKs work one
    CPU might be the unlucky victim forever, with 2 CPUs starving it forever.

    For the case where all cores are at equal priority, one wants a semblance
    of fairness--witness CRAY-XMP as one where cores 7 and 8 could be locked
    out of memory references when cores [1..6] were beating on memory at full speed. {Not even ATOMIC}

    For the case where cores have priority, one wants fairness based on priority and when 1 core is at lower priority, it takes the hits to allow higher priority cores to make <more> forward progress.

    One would not want a low priority Guest OS worker thread to hold up some
    high priority kernel thread just because it got to the lock first. This
    is commonly known as priority inversion.

    In general, it's best to avoid retry-like behavior. For simple cases, you could build in some starvation detection, but in general that just pushes your starvation/livelock case to a more complex case, not fix it.

    When a core detects failure it increments a counter, when the counter gets
    to a certain point, the core uses a different strategy to make forward progress. Under this <rarely used> strategy, it bundles up the addresses
    it needs to complete the event and ships them to a system arbiter. If no address is currently reserved, the system arbiter grants exclusive access
    to all <up to> 8 lines simultaneously. At this point core can Nak any
    snoop on those addresses. When the event is complete, core sends the same addresses to system arbiter; who removes their reservations.

    This can transform a BigO( n^3 ) problem under heavy contention into a
    BigO( 3 ) problem {when SW knows how to use the data HW provides}.

    When it does respond to the snoops, the first snooper immediately wins the >> lock and gets its work done, and so on. For this to work easily, the
    software should NOT peek at the lock location with a load, to "see" if
    it's free. It should just grab it.

    This seems to me to be wrongly optimistic!! Another core attempting a >transaction on a concurrent data structure, and the attempt fails
    needs to understand that the CDS is no longer in the state this core
    read out at the beginning, and instead of trying to complete, it should >take a fresh look at CDS. For example several cores try to deQueue
    work from the front of a Queue. One core "wins" and deQueues the front
    unit of work. The others need to know the Queue element they were going >after is not there are fail to a point where they take a fresh look at
    the Queue.

    I tried to expand my thinking with the log() function example so you can see that each CPU will see the updated state left by the previous one.

    For your CPU, you propose a Transaction sequence with a start and end instruction, and I'm just pointing out you can get basically all of the benefit with some prediction hardware picking out mutexes and making them just as efficient.

    Yes, I saw that.

    But with NONE of the deadlock or forward progress
    issues, since if things go "wrong", you just default to running the code
    as written.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:20:53 2026
    From Newsgroup: comp.arch

    On 6/7/2026 11:34 AM, MitchAlsup wrote:

    [email protected] (Kent Dickey) posted:

    Thank you Kent, much valuable information below.

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    This is mainly for Chris M. Thomasson but everyone is free to pile on::

    Given a normal chip with multiple cores and a typical cache hierarchy
    with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties >>> see that the unit of work was either never started or finished completely >>> without any intermediate state being observed by any party other than the >>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >>> fall under this definition.

    DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK CMPXCHG8B on a 32 bit system.


    define: memory-atomic as a unit of work identified in a core and packaged >>> up to be performed by the cache hierarchy or PCIe device control registers. >>> Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP,
    CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >>> of-work are performed in the cache having write-permission {EXCLUSIVE or >>> MODIFIED cache line states}. When unCacheable, these units of work are
    performed directly on the addressed control register. In both cases, any >>> interested party sees only the previous state or the after state.

    I understand why some folks like memory-atomic--it solves the
    transaction storm issue when a location is heavily contended. The
    tradeoff is making the atomic action very slow all the time even when
    there's no contention. To simplify discussion, I'm just going to call
    the contended atomic location a "lock", even if it isn't really one.

    LL/SCs are designed to take no more than LD/ST pipeline time outside of contention.

    The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
    atomic and it's not in the cache, then it sends the atomic operation
    into the fabric to let some other central thing do the atomic operation.
    And your core-atomic is just a way to say the atomic is done by a CPU
    once it gets the cacheline Exclusive.

    What do you do when you have more than 1 cache line--and all lines need
    to be in an exclusive state prior to modifying concurrent data ??

    I feel the best rule for atomic design (for hardware and software) is to
    always optimize the non-contended case, and try to make the
    non-contended case as likely as possible. The core-atomic is therefore
    what should be planned for and optimized for.

    OK.

    The problem with memory-atomic/Far-atomic is if the lock is on the same
    cacheline as the data its protecting (which is pretty common), then
    there is a lot of extra memory traffic pushing this around.

    The whole reason for {memory,far}-atomic is to avoid moving data around
    the interconnect inorder to perform an event/transaction.

    To do the
    Far-atomic needs to snoop that data out, do the atomic, return the
    atomic result, and then the initiating CPU still needs to fetch the
    line. This is all wasted time, especially when there's no contention! The >> usual solution to contention is lots of locks, so the chance that any of
    them will be in the cache is low.

    Fair point-but Chris M. Thomasson makes the point that the locks should not be placed together--your position is that this happens "more or less" all
    the time. Hmmmmm

    Unfortunately its true. The mutex word, or state if you will, should be
    padded and aligned up to a l2 cache line for atomic RMW or a reservation granule for LL/SC.


    There are ways to reduce the transaction storm around a contended lock,
    but they require hardware to get more involved in the entire
    "transaction". To take an example, assume a mutex protects some piece
    of data which needs some software manipulation (think the head of a
    linked list, and we're going to insert at the head). If written out as
    an explicit transaction, where software has to interact with the
    platform-specific way to do an atomic transaction (and with all the
    restrictions), this is high overhead for software, and does NOT match how
    most software is written.

    But hardware could dynamically detect transactions when done as
    core-atomic. Assume an atomic operation is the start of a
    "transaction", for instance where we get a lock, then do something
    while owning the lock, and then release the lock. Once we bring in that
    cacheline Exclusive and do the atomic operation, a Tentative Transaction
    starts. You block responding to snoops until we decide we have to. We

    In CAS atomics, the LDs which start the lock acquire are not special
    in any way; making it hard for core to detect. LL, on the other hand,
    tells core that something special is beginning. This reduces contention because the LD (of CAS) is performed without "for modification" and
    thus the lines can be shared. But at CAS, the lines have to become
    Exclusive.

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded! strong
    does not allow that to occur. So, it can be used for interesting state machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly.
    Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
    LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:30:58 2026
    From Newsgroup: comp.arch

    On 6/7/2026 8:23 AM, Kent Dickey wrote:
    [...]
    This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing.

    What about weak vs strong CAS? Also how does emulating a LOCK XADD speed things up? Now SC can fail, and we have to loop. How many times do we
    have to loop to know what that emulated LOCK XADD finishes? Its akin to emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
    former is quite grand.

    Breaking things apart works just as well for LL/SC as atomic RMW.

    How does the LL/SC work wrt the arch? How fine grain is it? Is its
    reservation granule the same size as say a L2 cache line for atomic RMW?

    On systems with both LL/SC and atomic RMW, is it one size fits all?

    LL/SC can suffer from live lock and false sharing.

    LOCK XADD can suffer from false sharing, its not going to live lock on ya.

    [...]
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:35:01 2026
    From Newsgroup: comp.arch

    On 6/8/2026 5:45 PM, MitchAlsup wrote:
    [...]
    Do you understand how hard it is to perform Double-Compare-Double-Swap without something special in the protocol under withering contention ??

    It sounds like you are trying to fulfill a lot of the academic papers?

    However, DWCAS is a really important one. This is only a double width
    compare and swap that operates on two contiguous words. Works great for
    a great deal of real lock/wait-free algos. Now if the words are non-contiguous, then things get into LL/SC like worlds...

    [...]
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:37:54 2026
    From Newsgroup: comp.arch

    On 6/8/2026 5:45 PM, MitchAlsup wrote:
    [...]

    Have you ever read about KCSS? K compare single swap? An older paper.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Jun 10 01:25:02 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:

    [email protected] (Kent Dickey) posted:

    Thank you Kent, much valuable information below.

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    This is mainly for Chris M. Thomasson but everyone is free to pile on:: >>>
    Given a normal chip with multiple cores and a typical cache hierarchy
    with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties >>> see that the unit of work was either never started or finished completely >>> without any intermediate state being observed by any party other than the >>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS}
    fall under this definition.

    DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK CMPXCHG8B on a 32 bit system.

    I wrote DCADS to represent the two addresses are completely unrelated
    {like *p and *q are unrelated unless you know something special.
    You can update 2 pointers in a linked list.
    ------------------------
    In CAS atomics, the LDs which start the lock acquire are not special
    in any way; making it hard for core to detect. LL, on the other hand,
    tells core that something special is beginning. This reduces contention because the LD (of CAS) is performed without "for modification" and
    thus the lines can be shared. But at CAS, the lines have to become Exclusive.

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded!

    This is what allows the ABA problem to raise its ugly head !! There
    could be a week (7 days) between the LD and the CAS--there is no
    expectation that the data (modified hundreds of times per second
    over that week returns to the exact state at when the LD was performed.)

    strong
    does not allow that to occur. So, it can be used for interesting state machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly.

    I have no argument with that. My argument is that one needs FEWER ATOMIC
    events when more memory locations can be manipulated in a single event.
    Fewer Events lower the exponent of interference and improves performance.

    Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
    LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Jun 10 01:28:13 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 6/8/2026 5:45 PM, MitchAlsup wrote:
    [...]
    Do you understand how hard it is to perform Double-Compare-Double-Swap without something special in the protocol under withering contention ??

    It sounds like you are trying to fulfill a lot of the academic papers?

    However, DWCAS is a really important one. This is only a double width compare and swap that operates on two contiguous words. Works great for
    a great deal of real lock/wait-free algos. Now if the words are non-contiguous, then things get into LL/SC like worlds...

    DDCAS works better as both comparands are 64-bit and back-to-back.

    Don't see how you can much to work with two 32-bit comparands.

    [...]
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Jun 10 01:33:30 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:
    -------------------------

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded! strong
    does not allow that to occur. So, it can be used for interesting state machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly. Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
    LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.

    If you want the strong CAS it is written::

    LDD Rd,[address]
    ...
    ...
    LL R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    If you want weak CAS it is written:

    LL Rd,[address]
    ...
    ...
    LD R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    It just changes the length of time interference is being observed.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 12:32:47 2026
    From Newsgroup: comp.arch

    On 6/9/2026 6:28 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/8/2026 5:45 PM, MitchAlsup wrote:
    [...]
    Do you understand how hard it is to perform Double-Compare-Double-Swap
    without something special in the protocol under withering contention ??

    It sounds like you are trying to fulfill a lot of the academic papers?

    However, DWCAS is a really important one. This is only a double width
    compare and swap that operates on two contiguous words. Works great for
    a great deal of real lock/wait-free algos. Now if the words are
    non-contiguous, then things get into LL/SC like worlds...

    DDCAS works better as both comparands are 64-bit and back-to-back.

    Don't see how you can much to work with two 32-bit comparands.

    [...]

    Think of DWCAS on a 32-bit system. LOCK CMPXCHG8B on a 32 bit system. We
    can do many interesting things with it. LOCK CMPXCHG16B on a 64-bit
    system, ditto.

    DWCAS only operates on two contiguous words. Well, that alone allows us
    to create a lot of lock/wait free algos. The academic papers were
    interested in DCAS, KCSS, ect... They never had to make the cpu that can
    do that efficiently.

    FWIW, DWCAS can do other things besides ABA. Think of using two 32 bit
    regions of memory and a DWCAS can store "offsets" into both of them.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 12:50:18 2026
    From Newsgroup: comp.arch

    On 6/9/2026 6:25 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:

    [email protected] (Kent Dickey) posted:

    Thank you Kent, much valuable information below.

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    This is mainly for Chris M. Thomasson but everyone is free to pile on:: >>>>>
    Given a normal chip with multiple cores and a typical cache hierarchy >>>>> with one-or-more PCIe root complexes::

    define: core-atomic as a sequence of instructions (possible 1) that
    perform a single unit of work atomically--that is all interested parties >>>>> see that the unit of work was either never started or finished completely >>>>> without any intermediate state being observed by any party other than the >>>>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS}
    fall under this definition.

    DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK
    CMPXCHG8B on a 32 bit system.

    I wrote DCADS to represent the two addresses are completely unrelated
    {like *p and *q are unrelated unless you know something special.
    You can update 2 pointers in a linked list.
    ------------------------
    In CAS atomics, the LDs which start the lock acquire are not special
    in any way; making it hard for core to detect. LL, on the other hand,
    tells core that something special is beginning. This reduces contention
    because the LD (of CAS) is performed without "for modification" and
    thus the lines can be shared. But at CAS, the lines have to become
    Exclusive.

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded!

    This is what allows the ABA problem to raise its ugly head !! There
    could be a week (7 days) between the LD and the CAS--there is no
    expectation that the data (modified hundreds of times per second
    over that week returns to the exact state at when the LD was performed.)

    ABA is an issue, wrt well you know the appendix of a IBM paper. Iirc,
    Annn and Lynn, sorry for butchering their names, were around when CAS
    was invented, iirc Lynn said it was their initials?

    However, I tend to work around that. For instance think of a distributed amortized lock-free stack that completely avoid DWCAS and ABA. So, for
    insert we can use CAS. BUT, for the pop, we flush all with an atomic SWAP.



    strong
    does not allow that to occur. So, it can be used for interesting state
    machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly.

    I have no argument with that. My argument is that one needs FEWER ATOMIC events when more memory locations can be manipulated in a single event.
    Fewer Events lower the exponent of interference and improves performance.

    Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
    LL/SC is working on a reservation granule which can be different than an
    atomic RMW using L2 cache lines.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 12:53:07 2026
    From Newsgroup: comp.arch

    On 6/9/2026 6:33 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:
    -------------------------

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded! strong
    does not allow that to occur. So, it can be used for interesting state
    machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly.
    Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
    LL/SC is working on a reservation granule which can be different than an
    atomic RMW using L2 cache lines.

    If you want the strong CAS it is written::

    LDD Rd,[address]
    ...
    ...
    LL R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    If you want weak CAS it is written:

    LL Rd,[address]
    ...
    ...
    LD R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    It just changes the length of time interference is being observed.

    Pretty interesting! So, we can have spurious failure on the weak CAS and
    none on the strong CAS?

    strong CAS failure can be used in a state machine, weak CAS failure
    cannot...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 13:16:00 2026
    From Newsgroup: comp.arch

    On 6/8/2026 1:21 PM, Kent Dickey wrote:
    In article <1104fui$2ns6j$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/7/2026 8:23 AM, Kent Dickey wrote:
    In article <[email protected]>

    Sorry for the big snip...

    [...]


    Fwiw, from my personal experience, two locks need to be padded and
    aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
    the reservation granule. Get the granule size wrong on LL/SC and your SC
    can cause some interesting harm (a lot of sc failures, etc.) to another
    core's reservation even when the locks have no logical relationship to
    each other...

    So if your algo is using LL/SC you need to worry about the reservation
    granule.

    If its using something akin to LOCK RMW, well, the l2 cache line is your
    friend.

    Yes, different mutexes/locks/etc. need to be spread apart from each other,
    to avoid accidental contention. However, if you have a huge number of mutexes/locks/etc. then it's less important, since by design you're
    spreading accesses out over "enough", so you're less likely to get any contention. If you have 300 CPUs and 10,000 database rows with a mutex
    per row, then contention is not likely a big deal.

    My quick look at C++:

    #include <mutex>
    std::mutex sample_mutex;

    show sample_mutex is 48 bytes. This includes OS/thread overhead to put the thread to sleep if the fast-grab algorithm doesn't work after a few tries, along with some internal alignment padding.

    [...]

    Ahhh, not a C++ mutex but words in memory wrt lock-free algos using
    atomic RMW. They are different... Fwiw, check this out. A way to
    "emulate" a C++ std::atomic using a table of mutexes:

    read all when you get some time:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    Now, this is different because its NOT lock-free. Using say LOCK XADD is lock-free, BUT. We need to give it a break via amortization via distribution...

    We don't want a shit load of threads hammering a word that is used for
    LOCK XADD or even a LL/SC emulating a XADD.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 13:22:42 2026
    From Newsgroup: comp.arch

    On 6/10/2026 12:32 PM, Chris M. Thomasson wrote:
    On 6/9/2026 6:28 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/8/2026 5:45 PM, MitchAlsup wrote:
    [...]
    Do you understand how hard it is to perform Double-Compare-Double-Swap >>>> without something special in the protocol under withering contention ?? >>>
    It sounds like you are trying to fulfill a lot of the academic papers?

    However, DWCAS is a really important one. This is only a double width
    compare and swap that operates on two contiguous words. Works great for
    a great deal of real lock/wait-free algos. Now if the words are
    non-contiguous, then things get into LL/SC like worlds...

    DDCAS works better as both comparands are 64-bit and back-to-back.

    Don't see how you can much to work with two 32-bit comparands.
    [...]

    Think of DWCAS on a 32-bit system. LOCK CMPXCHG8B on a 32 bit system. We
    can do many interesting things with it. LOCK CMPXCHG16B on a 64-bit
    system, ditto.

    DWCAS only operates on two contiguous words. Well, that alone allows us
    to create a lot of lock/wait free algos. The academic papers were
    interested in DCAS, KCSS, ect... They never had to make the cpu that can
    do that efficiently.

    FWIW, DWCAS can do other things besides ABA. Think of using two 32 bit regions of memory and a DWCAS can store "offsets" into both of them.

    Wrt CAS in general, this is a fun thread:


    https://groups.google.com/g/comp.arch/c/1zNPUHo8YoI/m/jtq3VdG7BQAJ
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Thu Jun 11 02:58:23 2026
    From Newsgroup: comp.arch

    In article <110a0ml$ao8m$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/7/2026 8:23 AM, Kent Dickey wrote:
    [...]
    This avoids the need for any software changes, while directly speeding up
    atomics and what atomics are doing.

    What about weak vs strong CAS? Also how does emulating a LOCK XADD speed >things up? Now SC can fail, and we have to loop. How many times do we
    have to loop to know what that emulated LOCK XADD finishes? Its akin to >emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
    former is quite grand.

    Breaking things apart works just as well for LL/SC as atomic RMW.

    How does the LL/SC work wrt the arch? How fine grain is it? Is its >reservation granule the same size as say a L2 cache line for atomic RMW?

    On systems with both LL/SC and atomic RMW, is it one size fits all?

    LL/SC can suffer from live lock and false sharing.

    LOCK XADD can suffer from false sharing, its not going to live lock on ya.

    [...]

    You trimmed everything, so there's really nothing I can respond to. You
    have your view as to how atomics should work, and it's an interesting perspective, but not the only one.

    If you had access to a system where std::mutex was faster and handled contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
    than the best implementation on any system), would you use mutexes rather
    than lock-free algorithms?

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 11 11:53:23 2026
    From Newsgroup: comp.arch

    On 6/10/2026 7:58 PM, Kent Dickey wrote:
    In article <110a0ml$ao8m$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/7/2026 8:23 AM, Kent Dickey wrote:
    [...]
    This avoids the need for any software changes, while directly speeding up >>> atomics and what atomics are doing.

    What about weak vs strong CAS? Also how does emulating a LOCK XADD speed
    things up? Now SC can fail, and we have to loop. How many times do we
    have to loop to know what that emulated LOCK XADD finishes? Its akin to
    emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
    former is quite grand.

    Breaking things apart works just as well for LL/SC as atomic RMW.

    How does the LL/SC work wrt the arch? How fine grain is it? Is its
    reservation granule the same size as say a L2 cache line for atomic RMW?

    On systems with both LL/SC and atomic RMW, is it one size fits all?

    LL/SC can suffer from live lock and false sharing.

    LOCK XADD can suffer from false sharing, its not going to live lock on ya. >>
    [...]

    You trimmed everything, so there's really nothing I can respond to. You
    have your view as to how atomics should work, and it's an interesting perspective, but not the only one.

    If you had access to a system where std::mutex was faster and handled contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
    than the best implementation on any system), would you use mutexes rather than lock-free algorithms?

    Hummm...

    std::mutex is not lock-free by default.

    And more importantly, a LOCK XADD is fundamentally different from using
    a std::mutex to emulate one. They are not two implementations of the
    same thing.

    LOCK XADD is a single atomic instruction, one hardware transaction, no
    kernel involvement, no thread blocking, no scheduler interaction. Done.

    A std::mutex wrapping an increment is a completely different beast. Even
    in the uncontended fast path you're doing an atomic RMW on the lock
    word; a memory barrier; _critical section_; then another membar and
    atomic RMW on unlock. That's already more work than LOCK XADD before
    you've even touched the data. Under contention it gets much worse —
    threads block, context switches happen, the OS scheduler gets involved.

    So the hypothetical of 'what if std::mutex was faster' is interesting
    but it's asking the wrong question. The question isn't speed in
    isolation — it's what guarantees and costs come with the primitive. LOCK XADD gives you atomicity with no blocking, no kernel, no priority
    inversion, no deadlock risk. A mutex gives you none of those properties regardless of how fast it is.

    There are also contexts where you literally cannot use a mutex — signal handlers, certain real-time contexts, interrupt service routines.
    Lock-free is not just a performance choice, it's sometimes the only choice.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Thu Jun 11 21:30:18 2026
    From Newsgroup: comp.arch

    In article <110f075$1lt51$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/10/2026 7:58 PM, Kent Dickey wrote:
    In article <110a0ml$ao8m$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    [snip]
    On systems with both LL/SC and atomic RMW, is it one size fits all?

    LL/SC can suffer from live lock and false sharing.

    LOCK XADD can suffer from false sharing, its not going to live lock on ya. >>>
    [...]

    You trimmed everything, so there's really nothing I can respond to. You
    have your view as to how atomics should work, and it's an interesting
    perspective, but not the only one.

    If you had access to a system where std::mutex was faster and handled
    contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
    than the best implementation on any system), would you use mutexes rather
    than lock-free algorithms?

    Hummm...

    std::mutex is not lock-free by default.

    And more importantly, a LOCK XADD is fundamentally different from using
    a std::mutex to emulate one. They are not two implementations of the
    same thing.

    Lock-free algorithms solve two problems of mutexes: they are immune to a
    thread being reschedules while holding a lock; and they are a response
    to poor performance of locks on many systems. Remote-memory atomics
    (doing the atomic in the fabric or at the memory controller) are a
    response to the poor contention performance of locks or atomics (on lock-free or locks) being done by the CPU on many platforms.

    But if we fix those two problems, and make locks fast, then there's much
    less need for lock-free algorithms.

    Lock-free also solves the thread preemption issue--a thread that wins a mutex and then gets interrupted blocks the entire application that relies on
    that mutex since that thread is no longer running, so if the platforms gave some guarantees about that not occurring, then any lock-free algorithm
    can pretty easily be changed to use a non-preemptible mutex instead.


    LOCK XADD is a single atomic instruction, one hardware transaction, no >kernel involvement, no thread blocking, no scheduler interaction. Done.

    A std::mutex wrapping an increment is a completely different beast. Even
    in the uncontended fast path you're doing an atomic RMW on the lock
    word; a memory barrier; _critical section_; then another membar and
    atomic RMW on unlock. That's already more work than LOCK XADD before
    you've even touched the data. Under contention it gets much worse — >threads block, context switches happen, the OS scheduler gets involved.

    So the hypothetical of 'what if std::mutex was faster' is interesting
    but it's asking the wrong question. The question isn't speed in
    isolation — it's what guarantees and costs come with the primitive. LOCK >XADD gives you atomicity with no blocking, no kernel, no priority
    inversion, no deadlock risk. A mutex gives you none of those properties >regardless of how fast it is.

    You may want to re-read my earlier post. I just proposed a way to make
    mutex faster than normal processors do even the most basic atomic
    sequences. I didn't describe this in that post, but if you add in a CPU
    which supports sequential ordering with the mutex Transaction Prediction
    logic, then that CPU can speculate and execute multiple locks at once
    easily (nested or serial), which is something weakly-ordered CPUs just
    cannot do since the barriers prevent starting new requests until
    ordering is guaranteed.

    There are also contexts where you literally cannot use a mutex — signal >handlers, certain real-time contexts, interrupt service routines.
    Lock-free is not just a performance choice, it's sometimes the only choice.

    If you have non-interruptable user mutexes, then it can work in these
    places. I did not explain how to do this, but is a pretty obvious
    add-on, but likely requires a new hint bit on the atomic instructions. A non-interruptable mutex ends up looking a bit like an explicit
    Transaction, but it has different properties and a very different HW implementation. Basically, the atomic operation needs a hint that this
    is a non-interruptable mutex, so the hardware knows how to release the
    lock if an exception occurs (such as a TLB miss). A compare-and-swap
    getting the mutex is different than a compare-and-swap doing a lock free algorithm, so we need a direct indication from the code that it's a
    mutex to get this right. Note that not getting the hint can still
    let HW make the mutex fast, it just isn't non-interruptable.

    What's nice about a mutex, which is basically a user-land spinlock, is
    it's about easiest for software (it's almost always the first described
    to new coders since it's easy to understand), AND the easiest for
    hardware to accelerate. The critical section is all in one place.
    Lock-free algorithms make it much harder for hardware to see what is the
    real action that needs to occur. Inserting a new head pointer to a
    linked list requires looking at the current head pointer, setting my new
    item's next-> to point to that, then doing Compare-and-Swap on the head
    pointer with our item pointer. It's hard to have hardware detect the
    prep work (but not impossible--in this particular case, we can see a
    load to the head pointer, then the write to the next-> field, and then the
    CAS to the head pointer, and can try to make that whole sequence atomic,
    but this is just a trivial lock-free case).

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Jun 12 02:16:32 2026
    From Newsgroup: comp.arch


    [email protected] (Kent Dickey) posted:

    ERROR "unexpected byte sequence starting at index 2542: '\xE2'" while decoding:

    In article <110f075$1lt51$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/10/2026 7:58 PM, Kent Dickey wrote:
    In article <110a0ml$ao8m$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    [snip]
    On systems with both LL/SC and atomic RMW, is it one size fits all?

    LL/SC can suffer from live lock and false sharing.

    LOCK XADD can suffer from false sharing, its not going to live lock on ya.

    [...]

    You trimmed everything, so there's really nothing I can respond to. You >> have your view as to how atomics should work, and it's an interesting
    perspective, but not the only one.

    If you had access to a system where std::mutex was faster and handled
    contention better than DCAS, CAS, and LOCK XADD in all cases (and faster >> than the best implementation on any system), would you use mutexes rather >> than lock-free algorithms?

    Hummm...

    std::mutex is not lock-free by default.

    And more importantly, a LOCK XADD is fundamentally different from using
    a std::mutex to emulate one. They are not two implementations of the
    same thing.

    Lock-free algorithms solve two problems of mutexes: they are immune to a thread being reschedules while holding a lock; and they are a response
    to poor performance of locks on many systems. Remote-memory atomics
    (doing the atomic in the fabric or at the memory controller) are a
    response to the poor contention performance of locks or atomics (on lock-free or locks) being done by the CPU on many platforms.

    But if we fix those two problems, and make locks fast, then there's much
    less need for lock-free algorithms.

    Lock-free also solves the thread preemption issue--a thread that wins a mutex and then gets interrupted blocks the entire application that relies on
    that mutex since that thread is no longer running, so if the platforms gave some guarantees about that not occurring, then any lock-free algorithm
    can pretty easily be changed to use a non-preemptible mutex instead.

    The LL/SC sequence on My 66000 implementations is guaranteed to fail
    when an interrupt (or exception) causes a transfer of control out of
    the thread, not while running the critical section, but while acquiring
    or releasing the mutex.


    LOCK XADD is a single atomic instruction, one hardware transaction, no >kernel involvement, no thread blocking, no scheduler interaction. Done.

    A std::mutex wrapping an increment is a completely different beast. Even >in the uncontended fast path you're doing an atomic RMW on the lock
    word; a memory barrier; _critical section_; then another membar and
    atomic RMW on unlock. That's already more work than LOCK XADD before >you've even touched the data. Under contention it gets much worse — >threads block, context switches happen, the OS scheduler gets involved.

    So the hypothetical of 'what if std::mutex was faster' is interesting
    but it's asking the wrong question. The question isn't speed in
    isolation — it's what guarantees and costs come with the primitive. LOCK
    XADD gives you atomicity with no blocking, no kernel, no priority >inversion, no deadlock risk. A mutex gives you none of those properties >regardless of how fast it is.

    You may want to re-read my earlier post. I just proposed a way to make
    mutex faster than normal processors do even the most basic atomic
    sequences.

    Do you agree that an ISA cannot make an atomic-acquire or atomic-release
    any faster than a core can perform the same instruction sequence without
    any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).

    I didn't describe this in that post, but if you add in a CPU which supports sequential ordering with the mutex Transaction Prediction logic, then that CPU can speculate and execute multiple locks at once
    easily (nested or serial), which is something weakly-ordered CPUs just
    cannot do since the barriers prevent starting new requests until
    ordering is guaranteed.

    The LL/SC semantics on My 66000 directly supports using as many as
    8 cache lines in a single ATOMIC event with the property that all
    3rd party observers only see the before and after states of all
    cache lines. From the beginning of the first LL to the end with an
    SC, all such participating memory modifications appear atomic.

    Thus, there was no need for the concept of nesting nor serialization.
    One does not have to get 1 mutex and then get another--you can get both
    at the same time (should you need that).

    This brings up the notion of the difference between an ATOMIC-event
    and a critical section. An ATOMIC-event is a unit of work that all
    interested 3rd parties can all agree on "order" {Joe got it Sally
    did not}. All of this can be handled by addresses, without consideration
    of any data value (address values yes, data values, no). Acquire-mutex
    and release-mutex are ATOMIC based on address. While the rest of the
    critical section is primarily based on data (and pointers used as data).

    There are also contexts where you literally cannot use a mutex — signal
    handlers, certain real-time contexts, interrupt service routines. >Lock-free is not just a performance choice, it's sometimes the only choice.

    If you have non-interruptable user mutexes, then it can work in these
    places. I did not explain how to do this, but is a pretty obvious
    add-on, but likely requires a new hint bit on the atomic instructions.

    It also requires a control transfer point so that if an interrupt is
    not able to be deferred, that the mutex guarded critical section has
    a way to KNOW that the un-interruptible-event was, indeed, interrupted.
    My 66000 has this built into its pipelined version of LL/SC.

    A non-interruptable mutex ends up looking a bit like an explicit
    Transaction, but it has different properties and a very different HW implementation.

    I like to say, My 66000 ATOMICs are not a form of TM but can be used to
    make a SW version of TM have the appropriate semantics guarantees.

    Basically, the atomic operation needs a hint that this
    is a non-interruptable mutex, so the hardware knows how to release the
    lock if an exception occurs (such as a TLB miss). A compare-and-swap
    getting the mutex is different than a compare-and-swap doing a lock free algorithm, so we need a direct indication from the code that it's a
    mutex to get this right. Note that not getting the hint can still
    let HW make the mutex fast, it just isn't non-interruptable.

    What's nice about a mutex, which is basically a user-land spinlock, is
    it's about easiest for software (it's almost always the first described
    to new coders since it's easy to understand), AND the easiest for
    hardware to accelerate. The critical section is all in one place.

    From the HW point of view (and the programming model of My 66000),
    There are 3 noted events:
    a) acquisition of the mutex(s)
    b) transit of the critical section
    c) release of the mutex(s).

    The problem with old style ATOMICs is that while (a) is easily visible
    as ATOMIC, (c) is not (although a predictor would likely work quite
    well.

    The problem with CAS style ATOMICs is that the initial LD(s) are not
    seen as ATOMIC until the CAS is performed, opening a window for ABA.

    The problem with LL/SC style ATOMICs is that while the whole sequence
    is known to be a critical section, spurious failures are allowed in
    most implementations.

    It is possible with My 66000 ATOMICs to perform the whole critical section WITHIN a single ATOMIC-event.

    Lock-free algorithms make it much harder for hardware to see what is the
    real action that needs to occur. Inserting a new head pointer to a
    linked list requires looking at the current head pointer, setting my new item's next-> to point to that, then doing Compare-and-Swap on the head pointer with our item pointer.

    Try that on a doubly linked list. That is where DCADS comes in.

    It's hard to have hardware detect the
    prep work (but not impossible--in this particular case, we can see a
    load to the head pointer, then the write to the next-> field, and then the CAS to the head pointer, and can try to make that whole sequence atomic,
    but this is just a trivial lock-free case).

    Why thinking about this this afternoon, I am left wondering if the page
    tables could be augmented with a bit indicating that ATOMIC stuff is going
    on on this page ??

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Jun 12 02:20:34 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 6/9/2026 6:33 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:
    -------------------------

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded! strong
    does not allow that to occur. So, it can be used for interesting state
    machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly.
    Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
    LL/SC is working on a reservation granule which can be different than an >> atomic RMW using L2 cache lines.

    If you want the strong CAS it is written::

    LDD Rd,[address]
    ...
    ...
    LL R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    If you want weak CAS it is written:

    LL Rd,[address]
    ...
    ...
    LD R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    It just changes the length of time interference is being observed.

    Pretty interesting! So, we can have spurious failure on the weak CAS and none on the strong CAS?

    It is not zero, but it is exponentially fewer; and no architecture is
    going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).

    strong CAS failure can be used in a state machine, weak CAS failure cannot...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 12 02:43:22 2026
    From Newsgroup: comp.arch

    On 6/11/2026 2:30 PM, Kent Dickey wrote:

    Will get back to you but...

    What's nice about a mutex, which is basically a user-land spinlock

    NO! a mutex is not a spinlock. A mutex can allow a thread to wait in the kernel. A spinlock is way different. Now there are adpative mutex that
    can spin a bit before they hit the slow path, kernel wait.

    Fwiw, I have fun way to use mutexes using try_lock, but that is another
    story.


    Also, relaxed arch with competent programmers can make things fly. Btw,
    are you familiar with RCU?



    _____

    This is using the idea of trying to do other work while a mutex is
    contended. Why spin or wait when we don't really have to? The per-thread random numbers in this experiment are only there to try to simulate when
    a contended thread has any other work to do. This can be check
    per-thread stacks, queues, ect... Here is my example code. Can you get
    it to run, and can you show me the output? Now, this can work in real
    world environments. It's a stare to an adaptive mutex that just spins
    doing nothing... lol... ;^)

    I compiled it using c++20

    (read all...)
    ____________________________________
    // A Fun Mutex Pattern? Or, a Nightmare? Humm...
    // By: Chris M. Thomasson
    //___________________________________________________


    #include <iostream>
    #include <random>
    #include <numeric>
    #include <algorithm>
    #include <thread>
    #include <atomic>
    #include <mutex>


    #define CT_WORKERS (42)
    #define CT_ITERS (996699)
    #define CT_BACKOFFS (42)
    #define CT_RAND_MAX (20)
    #define CT_RAND_THRESHOLD (5)


    struct ct_shared
    {
    std::mutex m_fun_mutex;
    std::atomic<unsigned long> m_other_work = { 0 };
    int m_test_count0 = 0;

    void
    sanity_check_dump() const
    {
    std::cout << "(ct_shared:" << this << ")->" <<
    "m_test_count0 = " << m_test_count0 << ", " <<
    "m_other_work = " << m_other_work.load(std::memory_order_relaxed) << "\n";
    }

    bool
    sanity_check_validate() const
    {
    return (m_test_count0 == CT_ITERS * CT_WORKERS);
    }
    };



    void
    ct_worker_entry(
    ct_shared& shared
    ) {
    //std::cout << "ct_worker_entry" << std::endl; // testing thread
    race for sure...

    // Thread Local...
    std::random_device rnd_seed = { };
    std::mt19937 rnd_gen(rnd_seed());
    std::uniform_int_distribution<unsigned long> rnd_dist(0, CT_RAND_MAX);

    for (unsigned long i = 0; i < CT_ITERS; ++i)
    {
    // Lock logic...
    {
    unsigned long backoff = 0;

    while (! shared.m_fun_mutex.try_lock())
    {
    unsigned long rnd0 = rnd_dist(rnd_gen);

    if (rnd0 > CT_RAND_THRESHOLD || backoff > CT_BACKOFFS)
    {
    shared.m_fun_mutex.lock();
    break;
    }

    // do other work... :^)
    shared.m_other_work.fetch_add(1,
    std::memory_order_relaxed);

    // but not too much work... ;^o
    ++backoff;
    }
    }

    // Critical Section...
    {
    shared.m_test_count0 = shared.m_test_count0 + 1;
    }

    // Unlock
    {
    shared.m_fun_mutex.unlock();
    }
    }
    }


    int main()
    {
    // Hello... :^)
    {
    std::cout << "Hello ct_fun_mutex... lol? ;^) ver:(0.0.0)\n";
    std::cout << "By: Chris M. Thomasson\n";
    std::cout << "____________________________________________________\n";
    std::cout.flush();
    }

    // Create our fun things... ;^)
    ct_shared shared = { };
    std::thread workers[CT_WORKERS] = { };

    // Lanuch...
    {
    std::cout << "Launching Threads...\n";
    std::cout.flush();

    for (unsigned long i = 0; i < CT_WORKERS; ++i)
    {
    workers[i] = std::thread(ct_worker_entry, std::ref(shared));
    }
    }

    // Join...
    {
    std::cout << "Joining Threads... (computing :^)\n";
    std::cout.flush();
    for (unsigned long i = 0; i < CT_WORKERS; ++i)
    {
    workers[i].join();
    }
    }

    // Sanity Check...
    {
    shared.sanity_check_dump();

    if (! shared.sanity_check_validate())
    {
    std::cout << "\n\n**** Pardon my French, but FUCK!!!!!
    ****\n" << std::endl;
    }

    else
    {
    std::cout << "\nWe are Sane!\n\n";
    std::cout << "We completed " <<
    shared.m_other_work.load(std::memory_order_relaxed) <<
    " work items while waiting for the mutex..." << std::endl;
    }
    }

    // Fin...
    {
    std::cout << "____________________________________________________\n";
    std::cout << "Fin... :^)\n" << std::endl;
    }

    return 0;
    }
    ____________________________________

    Any luck? Its fun to see how many work items were completed when the
    mutex was contended...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Fri Jun 12 15:15:20 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    [email protected] (Kent Dickey) posted:
    In article <110f075$1lt51$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/10/2026 7:58 PM, Kent Dickey wrote:
    In article <110a0ml$ao8m$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    [snip]
    Lock-free also solves the thread preemption issue--a thread that wins a mutex
    and then gets interrupted blocks the entire application that relies on
    that mutex since that thread is no longer running, so if the platforms gave >> some guarantees about that not occurring, then any lock-free algorithm
    can pretty easily be changed to use a non-preemptible mutex instead.

    The LL/SC sequence on My 66000 implementations is guaranteed to fail
    when an interrupt (or exception) causes a transfer of control out of
    the thread, not while running the critical section, but while acquiring
    or releasing the mutex.


    LOCK XADD is a single atomic instruction, one hardware transaction, no
    kernel involvement, no thread blocking, no scheduler interaction. Done.

    A std::mutex wrapping an increment is a completely different beast. Even >> >in the uncontended fast path you're doing an atomic RMW on the lock
    word; a memory barrier; _critical section_; then another membar and
    atomic RMW on unlock. That's already more work than LOCK XADD before
    you've even touched the data. Under contention it gets much worse — >> >threads block, context switches happen, the OS scheduler gets involved.

    So the hypothetical of 'what if std::mutex was faster' is interesting
    but it's asking the wrong question. The question isn't speed in
    isolation it's what guarantees and costs come with the
    primitive. LOCK
    XADD gives you atomicity with no blocking, no kernel, no priority
    inversion, no deadlock risk. A mutex gives you none of those properties
    regardless of how fast it is.

    You may want to re-read my earlier post. I just proposed a way to make
    mutex faster than normal processors do even the most basic atomic
    sequences.

    Do you agree that an ISA cannot make an atomic-acquire or atomic-release
    any faster than a core can perform the same instruction sequence without
    any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).

    The obvious answer is no, and that is true for an uncontended atomic instruction. However, once contention comes in to play, anything that
    reduces the contention storm can be a big win. It's not clear how to
    reduce that storm when it's caused by ordinary LD/ST, but I see lots of
    ways to reduce it much of the time if HW can detect the operation's
    boundaries.

    I didn't describe this in that post, but if you add in a CPU
    which supports sequential ordering with the mutex Transaction Prediction
    logic, then that CPU can speculate and execute multiple locks at once
    easily (nested or serial), which is something weakly-ordered CPUs just
    cannot do since the barriers prevent starting new requests until
    ordering is guaranteed.

    The LL/SC semantics on My 66000 directly supports using as many as
    8 cache lines in a single ATOMIC event with the property that all
    3rd party observers only see the before and after states of all
    cache lines. From the beginning of the first LL to the end with an
    SC, all such participating memory modifications appear atomic.

    Thus, there was no need for the concept of nesting nor serialization.
    One does not have to get 1 mutex and then get another--you can get both
    at the same time (should you need that).

    This brings up the notion of the difference between an ATOMIC-event
    and a critical section. An ATOMIC-event is a unit of work that all
    interested 3rd parties can all agree on "order" {Joe got it Sally
    did not}. All of this can be handled by addresses, without consideration
    of any data value (address values yes, data values, no). Acquire-mutex
    and release-mutex are ATOMIC based on address. While the rest of the
    critical section is primarily based on data (and pointers used as data).

    Transactional Memory from 15 years ago was going to fundamentally change
    the way multi-threaded apps worked. And then it fizzled out. Because
    the goal was laudable: remove locks and atomic altogether, and make
    adding a record to a database be just:

    START_TXN();
    add_record_to_database();
    END_TXN();

    and just do add_record_to_database() effectively atomically. The problem
    is that edge cases mean sometimes add_record_to_database() can be 5-10 instructions, and sometimes it's thousands. And Transactional Memory
    is very complex for hardware to get right, and it's hard for software to
    get right too (since software cannot let there be thousands of instructions inside the TXN, so it needs to be rewritten).

    Basically, there are too many unsolved complex problems. Some of which
    apply to your ESM--needing software changes is a big problem. And so far
    no one has shown how to exhaustively validate this kind of mechanism.
    Because there's no real fallback: it has to work, it's not a feature that
    can just be "turned off".

    There are also contexts where you literally cannot use a mutex — signal
    handlers, certain real-time contexts, interrupt service routines.
    Lock-free is not just a performance choice, it's sometimes the only choice. >>
    If you have non-interruptable user mutexes, then it can work in these
    places. I did not explain how to do this, but is a pretty obvious
    add-on, but likely requires a new hint bit on the atomic instructions.

    It also requires a control transfer point so that if an interrupt is
    not able to be deferred, that the mutex guarded critical section has
    a way to KNOW that the un-interruptible-event was, indeed, interrupted.
    My 66000 has this built into its pipelined version of LL/SC.

    A non-interruptable mutex ends up looking a bit like an explicit
    Transaction, but it has different properties and a very different HW
    implementation.

    I like to say, My 66000 ATOMICs are not a form of TM but can be used to
    make a SW version of TM have the appropriate semantics guarantees.

    I've had more time to think about my vague description on how to make an non-interruptible mutex, and now I think it's not a good idea. I think
    trying to hold off actual interrupts is still a good idea, but not to have
    HW try to back out of the mutex operation itself. It's just too much like Transactional Memory--rather than being an optional acceleration which can
    be turned off a problem is found, non-interruptible mutex must always be correct and must provide guarantees, and so it's wandered into the minefield
    of Transactional Memory, and it's best to stay out of there.

    Basically, the atomic operation needs a hint that this
    is a non-interruptable mutex, so the hardware knows how to release the
    lock if an exception occurs (such as a TLB miss). A compare-and-swap
    getting the mutex is different than a compare-and-swap doing a lock free
    algorithm, so we need a direct indication from the code that it's a
    mutex to get this right. Note that not getting the hint can still
    let HW make the mutex fast, it just isn't non-interruptable.

    What's nice about a mutex, which is basically a user-land spinlock, is
    it's about easiest for software (it's almost always the first described
    to new coders since it's easy to understand), AND the easiest for
    hardware to accelerate. The critical section is all in one place.

    From the HW point of view (and the programming model of My 66000),
    There are 3 noted events:
    a) acquisition of the mutex(s)
    b) transit of the critical section
    c) release of the mutex(s).

    The problem with old style ATOMICs is that while (a) is easily visible
    as ATOMIC, (c) is not (although a predictor would likely work quite
    well.

    Yes, I agree. But note the release of the mutex is to the same address
    as the atomic instruction, so it can be detected after it's been seen.

    The problem with CAS style ATOMICs is that the initial LD(s) are not
    seen as ATOMIC until the CAS is performed, opening a window for ABA.

    Agreed, and it's hard to pick out the preparation work being done for
    these lock-free algorithms.

    The problem with LL/SC style ATOMICs is that while the whole sequence
    is known to be a critical section, spurious failures are allowed in
    most implementations.

    I am not a fan of LL/SC. It's possible to get it right in HW, but most designers underestimate the invasiveness of LL/SC--the entire load/store
    unit needs little modifications everywhere to make LL/SC work efficiently
    and reliably. You want to implement transient memory or writes which
    bypass the cache, or you speculatively execute instructions past the
    end of the LL/SC loop--make sure you solve how they interact with LL/SC.
    I've yet to meet any design which got it "right" the first try (but once
    a design team learns, then they get it right, until another new system
    feature pops up and breaks it).

    LL/SC is the academic solution to building "let's allow anything" atomics.
    "We can let the user build the atomic operations are their dreams, and
    we don't have to solidify it in HW, we just let them write code."
    It's like a preliminary Transactional Memory--we can let the users do a
    lot with us just needing to do a little bit in hardware.

    And then when LL/SC gets implemented, you basically end up with
    Fetch-and-Add, Swap, and Compare-and-Swap (and other trivial ops, like Fetch-and-AND, etc.) only. The RISC-V restrictions are pretty intense:
    You can do LR; you can then do a handful of integer instructions (no FP,
    no vector) not including any ld/st or system instructions, nor any
    backward branches, no function calls; and then SC, then you get the same restrictions but now you get a backwards branch. What exactly are you
    able to implement with these restrictions other than the usual atomic operations that now most everyone puts in HW? Note on RISC-V, if an implementation has LR/SC, then it must ALSO have the atomic instructions Fetch-and-Add, Swap, Compare-and-Swap, and the other simple ones. So why bother with LR/SC?

    I understand where RISC-V's LL/SC restrictions are coming from. I get it. Which is why I don't like LL/SC.


    It is possible with My 66000 ATOMICs to perform the whole critical section >WITHIN a single ATOMIC-event.

    Lock-free algorithms make it much harder for hardware to see what is the
    real action that needs to occur. Inserting a new head pointer to a
    linked list requires looking at the current head pointer, setting my new
    item's next-> to point to that, then doing Compare-and-Swap on the head
    pointer with our item pointer.

    Try that on a doubly linked list. That is where DCADS comes in.

    When I say Compare-and-Swap, I implicitly include DCAS. DCADS is more
    exotic, so if I mean that, I'd explicitly list it.

    It's hard to have hardware detect the
    prep work (but not impossible--in this particular case, we can see a
    load to the head pointer, then the write to the next-> field, and then the >> CAS to the head pointer, and can try to make that whole sequence atomic,
    but this is just a trivial lock-free case).

    Why thinking about this this afternoon, I am left wondering if the page >tables could be augmented with a bit indicating that ATOMIC stuff is going >on on this page ??

    I'm just not sure that provides much help.

    To me, the key insight which I wanted to get across was that with a mutex
    (and spinlock), it's a huge advantage to NOT let other CPUs ever see the
    mutex lock location in the locked state. It solves the contention storm,
    and yet it's "optional" meaning if hardware has to "give up", it's still correct (just lower performance). And that lock-free algorithms would
    get the same benefit IF IT CAN BE DETECTED, but that it's much harder
    for HW to detect (since the lock-free set of operations is potentially
    much larger and more complex).

    And that with HW support to solve the contention storm, a mutex (and
    a spinlock, which is inside a mutex), which is easy for software to use,
    could become faster than lock-free algorithms (when spinlocks or mutexes
    make sense, since I'm abandoning the idea of making them work in interrupt handlers, etc.).

    For Linux spinlocks, if my memory of this is still right, the Linux
    kernel code often needs to mask interrupts, so I suggest making masking interrupts an extremely lightweight operation. When you see an
    instruction which seems like it would mask interrupts in Fetch/Decode, immediately set a bit to block interrupts. Once the instruction
    executes (or is discarded), clear that bit (maybe make it a 4-bit
    counter, so you can speculate further). Don't serialize the instruction
    that masks off interrupts.

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Jun 12 17:11:29 2026
    From Newsgroup: comp.arch


    [email protected] (Kent Dickey) posted:

    ERROR "unexpected byte sequence starting at index 1569: '\xC3'" while decoding:

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    [email protected] (Kent Dickey) posted:
    In article <110f075$1lt51$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    On 6/10/2026 7:58 PM, Kent Dickey wrote:
    In article <110a0ml$ao8m$[email protected]>,
    Chris M. Thomasson <[email protected]> wrote:
    [snip]
    Lock-free also solves the thread preemption issue--a thread that wins a mutex
    and then gets interrupted blocks the entire application that relies on
    that mutex since that thread is no longer running, so if the platforms gave
    some guarantees about that not occurring, then any lock-free algorithm
    can pretty easily be changed to use a non-preemptible mutex instead.

    The LL/SC sequence on My 66000 implementations is guaranteed to fail
    when an interrupt (or exception) causes a transfer of control out of
    the thread, not while running the critical section, but while acquiring
    or releasing the mutex.


    LOCK XADD is a single atomic instruction, one hardware transaction, no >> >kernel involvement, no thread blocking, no scheduler interaction. Done. >> >
    A std::mutex wrapping an increment is a completely different beast. Even >> >in the uncontended fast path you're doing an atomic RMW on the lock
    word; a memory barrier; _critical section_; then another membar and
    atomic RMW on unlock. That's already more work than LOCK XADD before
    you've even touched the data. Under contention it gets much worse —
    threads block, context switches happen, the OS scheduler gets involved. >> >
    So the hypothetical of 'what if std::mutex was faster' is interesting
    but it's asking the wrong question. The question isn't speed in
    isolation it's what guarantees and costs come with the
    primitive. LOCK
    XADD gives you atomicity with no blocking, no kernel, no priority
    inversion, no deadlock risk. A mutex gives you none of those properties >> >regardless of how fast it is.

    You may want to re-read my earlier post. I just proposed a way to make
    mutex faster than normal processors do even the most basic atomic
    sequences.

    Do you agree that an ISA cannot make an atomic-acquire or atomic-release >any faster than a core can perform the same instruction sequence without >any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).

    The obvious answer is no, and that is true for an uncontended atomic instruction. However, once contention comes in to play, anything that reduces the contention storm can be a big win.

    The main motivation for allowing a single ATOMIC-event to perform data manipulation on more than one memory location is to reduce the total
    number of ATOMIC-events--and thereby reducing the amount of contention.

    It's not clear how to
    reduce that storm when it's caused by ordinary LD/ST, but I see lots of
    ways to reduce it much of the time if HW can detect the operation's boundaries.

    When I designed ASF at AMD (2005), it was clear that one of the problems
    of "regular" synchronization was that the release* of the mutex was not
    marked in some way as ATOMIC--it was a ST #0 (or equivalent). SC converts
    that ST into something that is easily recognized as ATOMIC.

    (*) test_and_set was::

    TS Rd,[address]
    BNE0 label
    ...
    ... // critical section work
    ...
    ST #0,[address] // takes address comparator to see
    // this is part of an ATOMIC-event


    I didn't describe this in that post, but if you add in a CPU >> which supports sequential ordering with the mutex Transaction Prediction >> logic, then that CPU can speculate and execute multiple locks at once
    easily (nested or serial), which is something weakly-ordered CPUs just
    cannot do since the barriers prevent starting new requests until
    ordering is guaranteed.

    The LL/SC semantics on My 66000 directly supports using as many as
    8 cache lines in a single ATOMIC event with the property that all
    3rd party observers only see the before and after states of all
    cache lines. From the beginning of the first LL to the end with an
    SC, all such participating memory modifications appear atomic.

    Thus, there was no need for the concept of nesting nor serialization.
    One does not have to get 1 mutex and then get another--you can get both
    at the same time (should you need that).

    This brings up the notion of the difference between an ATOMIC-event
    and a critical section. An ATOMIC-event is a unit of work that all >interested 3rd parties can all agree on "order" {Joe got it Sally
    did not}. All of this can be handled by addresses, without consideration
    of any data value (address values yes, data values, no). Acquire-mutex
    and release-mutex are ATOMIC based on address. While the rest of the >critical section is primarily based on data (and pointers used as data).

    Transactional Memory from 15 years ago was going to fundamentally change
    the way multi-threaded apps worked. And then it fizzled out. Because
    the goal was laudable: remove locks and atomic altogether, and make
    adding a record to a database be just:

    START_TXN();
    add_record_to_database();
    END_TXN();

    and just do add_record_to_database() effectively atomically. The problem
    is that edge cases mean sometimes add_record_to_database() can be 5-10 instructions, and sometimes it's thousands. And Transactional Memory
    is very complex for hardware to get right, and it's hard for software to
    get right too (since software cannot let there be thousands of instructions inside the TXN, so it needs to be rewritten).

    The TM model was fundamentally flawed--SW was assuming the amount of data manipulation in a single transaction was unbounded:: HW does not do unbounded well.

    Basically, there are too many unsolved complex problems. Some of which
    apply to your ESM--needing software changes is a big problem. And so far
    no one has shown how to exhaustively validate this kind of mechanism.
    Because there's no real fallback: it has to work, it's not a feature that
    can just be "turned off".

    All of the now current atomic::stuff can be directly implemented in ESM.
    It has to be that way for any chance of success !! However, in order for
    SW to address the exponent of interference, the std::atomic::stuff needs
    some additional functionality. ESM has the ability to allow all the
    requisite SW work/experimentation to be performed without HW additions
    (as long as the ATOMIC-event does not need more than 8 cache lines of
    data to be manipulated in a single event.)

    There are also contexts where you literally cannot use a mutex — signal
    handlers, certain real-time contexts, interrupt service routines.
    Lock-free is not just a performance choice, it's sometimes the only choice.

    If you have non-interruptable user mutexes, then it can work in these
    places. I did not explain how to do this, but is a pretty obvious
    add-on, but likely requires a new hint bit on the atomic instructions.

    It also requires a control transfer point so that if an interrupt is
    not able to be deferred, that the mutex guarded critical section has
    a way to KNOW that the un-interruptible-event was, indeed, interrupted.
    My 66000 has this built into its pipelined version of LL/SC.

    A non-interruptable mutex ends up looking a bit like an explicit
    Transaction, but it has different properties and a very different HW
    implementation.

    I like to say, My 66000 ATOMICs are not a form of TM but can be used to >make a SW version of TM have the appropriate semantics guarantees.

    I've had more time to think about my vague description on how to make an non-interruptible mutex, and now I think it's not a good idea. I think trying to hold off actual interrupts is still a good idea, but not to have
    HW try to back out of the mutex operation itself.

    Agreed, a mutex is a SW model, while an ATOMIC-event is a HW model.

    It's just too much like Transactional Memory--rather than being an optional acceleration which can
    be turned off a problem is found, non-interruptible mutex must always be correct and must provide guarantees, and so it's wandered into the minefield of Transactional Memory, and it's best to stay out of there.

    I saw this in 2004...

    Basically, the atomic operation needs a hint that this
    is a non-interruptable mutex, so the hardware knows how to release the
    lock if an exception occurs (such as a TLB miss). A compare-and-swap
    getting the mutex is different than a compare-and-swap doing a lock free >> algorithm, so we need a direct indication from the code that it's a
    mutex to get this right. Note that not getting the hint can still
    let HW make the mutex fast, it just isn't non-interruptable.

    What's nice about a mutex, which is basically a user-land spinlock, is
    it's about easiest for software (it's almost always the first described
    to new coders since it's easy to understand), AND the easiest for
    hardware to accelerate. The critical section is all in one place.

    From the HW point of view (and the programming model of My 66000),
    There are 3 noted events:
    a) acquisition of the mutex(s)
    b) transit of the critical section
    c) release of the mutex(s).

    The problem with old style ATOMICs is that while (a) is easily visible
    as ATOMIC, (c) is not (although a predictor would likely work quite
    well.

    Yes, I agree. But note the release of the mutex is to the same address
    as the atomic instruction, so it can be detected after it's been seen.

    Way back when ASF was being born; another address comparator might
    have reasonable arguments against it. Now not so much. In any event,
    one does not want this address comparator to add to thread.state and
    impact ISR times; that is why ASF and ESM adopted a model whereby
    any ATOMIC-event was guaranteed to fail if an exception/interrupt
    transpired. This also gets rid of excessive thread.state at context
    switch time.

    The problem with CAS style ATOMICs is that the initial LD(s) are not
    seen as ATOMIC until the CAS is performed, opening a window for ABA.

    Agreed, and it's hard to pick out the preparation work being done for
    these lock-free algorithms.

    The problem with LL/SC style ATOMICs is that while the whole sequence
    is known to be a critical section, spurious failures are allowed in
    most implementations.

    I am not a fan of LL/SC. It's possible to get it right in HW, but most designers underestimate the invasiveness of LL/SC--the entire load/store
    unit needs little modifications everywhere to make LL/SC work efficiently
    and reliably. You want to implement transient memory or writes which
    bypass the cache, or you speculatively execute instructions past the
    end of the LL/SC loop--make sure you solve how they interact with LL/SC.
    I've yet to meet any design which got it "right" the first try (but once
    a design team learns, then they get it right, until another new system feature pops up and breaks it).

    LL/SC is the academic solution to building "let's allow anything" atomics.

    The academic ?solutions? fail to have the property that some kind of
    automatic flow-control is performed when it becomes known that the
    event will fail. ASF had a primitive version, ESM was a well worked out version.

    If one simply codes ESM to perform Test_and_set, HW conspires to actually implement Test_and_test_and_set--due to the way the above paragraph is performed in HW.

    "We can let the user build the atomic operations are their dreams, and
    we don't have to solidify it in HW, we just let them write code."

    This is ripe for undefined HW behavior--or worse, slightly varying
    behavior as time goes on.

    It's like a preliminary Transactional Memory--we can let the users do a
    lot with us just needing to do a little bit in hardware.

    See how well that worked out ?!? not ...

    And then when LL/SC gets implemented, you basically end up with Fetch-and-Add, Swap, and Compare-and-Swap (and other trivial ops, like Fetch-and-AND, etc.) only. The RISC-V restrictions are pretty intense:
    You can do LR; you can then do a handful of integer instructions (no FP,
    no vector) not including any ld/st or system instructions, nor any
    backward branches, no function calls; and then SC, then you get the same restrictions but now you get a backwards branch.

    My 66000 has none of these restrictions, BTW. The only caveat is that
    as the ATOMIC-event grows larger, the window for interference also
    grows larger and failures happen more often.

    What exactly are you
    able to implement with these restrictions other than the usual atomic operations that now most everyone puts in HW?

    DCADS, TCADS, TCATS, ...

    In additions, one can "do the ATOMIC-thing and also store a marker
    so debugger can figure out WHO did this to the concurrent data
    structure; One can add a time figure to the CDS, ... there are
    a lot of things one can do to make these things more debuggable,
    robust, {RAS arguments} by increasing the amount of data that
    can be performed in a single ATOMIC event.

    And then there is the reduction in the number of ATOMIC-events,
    which directly addresses interference.

    Note on RISC-V, if an implementation has LR/SC, then it must ALSO have the atomic instructions Fetch-and-Add, Swap, Compare-and-Swap, and the other simple ones. So why bother with LR/SC?

    This is what happens when one approaches ATOMIC-stuff with an academic mentality.

    I understand where RISC-V's LL/SC restrictions are coming from. I get it. Which is why I don't like LL/SC.

    In My 66000, when/if control arrives at the failure-point, one can access
    a control register that contains WHY the event failed. ) means success, negative means spurious, while positive is a direct measure of the amount
    of interference you encountered. The last one can be used in a subsequent ATOMIC-event proactively to go after a unit of work that no other thread
    is going after, and thereby eliminate interference. It is just that SW has
    to be "in that game", HW cannot do it by itself. No SW is currently written like this because no HW provides that kind of functionality.

    Chicken and Egg.


    It is possible with My 66000 ATOMICs to perform the whole critical section >WITHIN a single ATOMIC-event.

    Lock-free algorithms make it much harder for hardware to see what is the >> real action that needs to occur. Inserting a new head pointer to a
    linked list requires looking at the current head pointer, setting my new >> item's next-> to point to that, then doing Compare-and-Swap on the head
    pointer with our item pointer.

    Try that on a doubly linked list. That is where DCADS comes in.

    When I say Compare-and-Swap, I implicitly include DCAS. DCADS is more exotic, so if I mean that, I'd explicitly list it.

    It's hard to have hardware detect the
    prep work (but not impossible--in this particular case, we can see a
    load to the head pointer, then the write to the next-> field, and then the >> CAS to the head pointer, and can try to make that whole sequence atomic, >> but this is just a trivial lock-free case).

    Why thinking about this this afternoon, I am left wondering if the page >tables could be augmented with a bit indicating that ATOMIC stuff is going >on on this page ??

    I'm just not sure that provides much help.

    To me, the key insight which I wanted to get across was that with a mutex (and spinlock), it's a huge advantage to NOT let other CPUs ever see the mutex lock location in the locked state. It solves the contention storm,
    and yet it's "optional" meaning if hardware has to "give up", it's still correct (just lower performance).

    Consider that thread.A gets the mutex, then a high priority interrupt
    needs and ISR, and a context switch to a higher priority task. There
    is NO WAY for HW not to let everyone see the mutex in its locked state.
    The problem is that the mutex gets is state by having a datum modified !

    Whereas: in HW one can perform the near-mutex based solely on the address without any data being modified--since the address is subject to SNOOPs.
    This puts HW in a position where it can do a control-transfer before
    taking an exception/interrupt--thereby backing out of the near-mutex event. This is a key difference between MY 66000 ESM (and AMD ASF) and academic
    LL/SC. The amount of ATOMICally-modifyable data is another.

    And that lock-free algorithms would
    get the same benefit IF IT CAN BE DETECTED, but that it's much harder
    for HW to detect (since the lock-free set of operations is potentially
    much larger and more complex).

    And that with HW support to solve the contention storm, a mutex (and
    a spinlock, which is inside a mutex), which is easy for software to use, could become faster than lock-free algorithms (when spinlocks or mutexes
    make sense, since I'm abandoning the idea of making them work in interrupt handlers, etc.).

    2 Points::

    1) ESM can be used in ISRs.
    2) My 66000 has a very fast way to perform DPC/softIRQs based on its
    interrupt table model obviating the need for ISRs to do ATOMIC-stuff.

    For Linux spinlocks, if my memory of this is still right, the Linux
    kernel code often needs to mask interrupts,

    Or transition to a priority level that no interrupt can raise.

    so I suggest making masking interrupts an extremely lightweight operation. When you see an
    instruction which seems like it would mask interrupts in Fetch/Decode, immediately set a bit to block interrupts. Once the instruction
    executes (or is discarded), clear that bit (maybe make it a 4-bit
    counter, so you can speculate further). Don't serialize the instruction
    that masks off interrupts.

    Most of what you say above is unnecessary in My 66000 ISRs.
    a) the priority of the core is automagically updated with the priority
    of the interrupt.
    b) the thread.state of the interrupted thread is returned to memory
    c) the thread.state of the interrupting thread is fetched from memory
    d) the fetch of thread.state can start before the store of thread.state
    e) MSI-X interrupt message is already in R0.

    So, by the time control arrives, ISR has a complete thread.state, including
    the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and near-40-bits from the interrupt so it can figure out what to do based on control arriving here. ISR can use any instruction including ESM. ISR
    can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single instruction). When ISR performs SVR, SVR check pending threads in interrupt table, and transfers control to any pending thread higher than interrupted thread (SW does not have to check).

    Interrupts do not need to be masked, as all ISRs of lesser priority have already been blocked, and the thread is already in a re-entrant state--
    so, no quirky instruction sequences are necessary. ISRs can be compiled
    with no compiler-knowledge of it being an ISR (and no ASM instructions).


    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Jun 14 18:57:47 2026
    From Newsgroup: comp.arch


    MitchAlsup <[email protected]d> posted:

    Consider "Dining Philosophers" My 66000 style::

    struct { uint64_t left, right,
    time, wait, starve;
    } philosopher;

    Task Philosopher()
    {
    Local philosopher *p;

    do{
    if( get_forks( p )
    {
    eat();
    put_forks( p );
    }
    else
    {
    philosophise( p ); // wait p->time
    p->wait += p->time;
    if( p->wait > p->starve );
    dies( p ); // no return from here
    }
    } while( TRUE );
    }

    Where::

    get_fork:
    LDDL R3,[R1,#left] // start of event
    LDDL R4,[R1,#right]
    OR R5,R4,R3
    BEQ R5,exit
    STD #1,[R1,#left]
    STDL #1,[R1,#right] // end of event
    MOV R1,#1
    RET
    exit:
    INVAL #EVENT // terminate event without writing anything
    MOV R1,#0
    RET

    put_forks:
    STD #0,[R1,#left]
    STD #0,[R1,#right]
    STD #0,[R1,#wait]
    RET
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Sun Jun 14 13:45:05 2026
    From Newsgroup: comp.arch

    On 6/11/2026 7:20 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/9/2026 6:33 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:
    -------------------------

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded! strong
    does not allow that to occur. So, it can be used for interesting state >>>> machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly. >>>> Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS. >>>> LL/SC is working on a reservation granule which can be different than an >>>> atomic RMW using L2 cache lines.

    If you want the strong CAS it is written::

    LDD Rd,[address]
    ...
    ...
    LL R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    If you want weak CAS it is written:

    LL Rd,[address]
    ...
    ...
    LD R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    It just changes the length of time interference is being observed.

    Pretty interesting! So, we can have spurious failure on the weak CAS and
    none on the strong CAS?

    It is not zero, but it is exponentially fewer; and no architecture is
    going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).

    Just to clarify, strong CAS should never spuriously fail. Right?



    strong CAS failure can be used in a state machine, weak CAS failure
    cannot...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Sun Jun 14 21:14:03 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    [email protected] (Kent Dickey) posted:
    [snip]

    I'm just going to follow up on Linux kernel spinlocks.

    For Linux spinlocks, if my memory of this is still right, the Linux
    kernel code often needs to mask interrupts,

    Or transition to a priority level that no interrupt can raise.

    so I suggest making masking
    interrupts an extremely lightweight operation. When you see an
    instruction which seems like it would mask interrupts in Fetch/Decode,
    immediately set a bit to block interrupts. Once the instruction
    executes (or is discarded), clear that bit (maybe make it a 4-bit
    counter, so you can speculate further). Don't serialize the instruction
    that masks off interrupts.

    Most of what you say above is unnecessary in My 66000 ISRs.
    a) the priority of the core is automagically updated with the priority
    of the interrupt.
    b) the thread.state of the interrupted thread is returned to memory
    c) the thread.state of the interrupting thread is fetched from memory
    d) the fetch of thread.state can start before the store of thread.state
    e) MSI-X interrupt message is already in R0.

    The issue with spinlocks in the Linux kernel is that the kernel a lot of
    the time has interrupts enabled. And the kernel has spinlocks it needs to
    win. But if the kernel thread gets interrupted with a higher priority task once it gets the spinlock, but before it finishes with it, this is bad since some other thread now has to wait a long time for the spinlock to free up.
    It also might cause deadlocks.

    So the Linux macro is spinlock_irqsave(), which first masks interrupts,
    then wins the spinlock (spinning until it wins). Then do the operation
    needed, then do spinunlock_irqrestore() to release the spinlock and
    allow interrupts again.

    The basic idea is this should be fast. We DO NOT want a higher priority interrupt to occur while we own the spinlock. So we just need to
    temporarily mask interrupts. And I'm suggesting this masking and unmasking
    of interrupts should be very lightweight and fast.

    I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe they've moved all the important stuff to not use spinlock_irqsave() anymore,
    I don't know. It still shows up a lot in the Linux source though in various drivers at least.

    So, by the time control arrives, ISR has a complete thread.state, including >the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and >near-40-bits from the interrupt so it can figure out what to do based on >control arriving here. ISR can use any instruction including ESM. ISR
    can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single >instruction). When ISR performs SVR, SVR check pending threads in interrupt >table, and transfers control to any pending thread higher than interrupted >thread (SW does not have to check).

    Interrupts do not need to be masked, as all ISRs of lesser priority have >already been blocked, and the thread is already in a re-entrant state--
    so, no quirky instruction sequences are necessary. ISRs can be compiled
    with no compiler-knowledge of it being an ISR (and no ASM instructions).

    No, during the kernel spinlock_irqsave(), you have to mask out all
    interrupts, including higher-priority ones.

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Sun Jun 14 21:46:19 2026
    From Newsgroup: comp.arch

    [email protected] (Kent Dickey) writes:
    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    [email protected] (Kent Dickey) posted:
    [snip]

    I'm just going to follow up on Linux kernel spinlocks.

    For Linux spinlocks, if my memory of this is still right, the Linux
    kernel code often needs to mask interrupts,

    Or transition to a priority level that no interrupt can raise.

    The issue with spinlocks in the Linux kernel is that the kernel a lot of
    the time has interrupts enabled. And the kernel has spinlocks it needs to >win. But if the kernel thread gets interrupted with a higher priority task >once it gets the spinlock, but before it finishes with it, this is bad since >some other thread now has to wait a long time for the spinlock to free up.
    It also might cause deadlocks.

    So the Linux macro is spinlock_irqsave(), which first masks interrupts,
    then wins the spinlock (spinning until it wins). Then do the operation >needed, then do spinunlock_irqrestore() to release the spinlock and
    allow interrupts again.

    The basic idea is this should be fast. We DO NOT want a higher priority >interrupt to occur while we own the spinlock. So we just need to
    temporarily mask interrupts. And I'm suggesting this masking and unmasking >of interrupts should be very lightweight and fast.

    I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe >they've moved all the important stuff to not use spinlock_irqsave() anymore, >I don't know. It still shows up a lot in the Linux source though in various >drivers at least.

    So, by the time control arrives, ISR has a complete thread.state, including >>the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and >>near-40-bits from the interrupt so it can figure out what to do based on >>control arriving here. ISR can use any instruction including ESM. ISR
    can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single >>instruction). When ISR performs SVR, SVR check pending threads in interrupt >>table, and transfers control to any pending thread higher than interrupted >>thread (SW does not have to check).

    Interrupts do not need to be masked, as all ISRs of lesser priority have >>already been blocked, and the thread is already in a re-entrant state--
    so, no quirky instruction sequences are necessary. ISRs can be compiled >>with no compiler-knowledge of it being an ISR (and no ASM instructions).

    No, during the kernel spinlock_irqsave(), you have to mask out all >interrupts, including higher-priority ones.

    While true, spinlock_irqsave() will mask interrupts on the current core/hardware thread; Other cores/threads will continue running and
    taking interrupts that are either 1-of-N interrupts or specifically
    assigned to a core.

    In any operating system where nested interrupts on a core are supported, the interrupt handling code must be very carefully coded to ensure that
    the handler of a lower priority interrupt does not obtain resources
    (e.g. locks) that the higher priority interrupt will require to complete processing the interrupt (classic deadlock). There are other issues
    related to kernel thread stack sizes that also pop up with nested
    interrupts.

    Linux doesn't support nested interrupts on a core, rather it
    defers the interrupt hanndling work to a kernel thread and returns
    immediately from the interrupt, unless handling the interrupt is trivial.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Sun Jun 14 20:18:13 2026
    From Newsgroup: comp.arch

    [email protected] (Kent Dickey) writes:
    The issue with spinlocks in the Linux kernel is that the kernel a lot of
    the time has interrupts enabled.

    Most of the spinlocks I've lived with were in a system with a notion
    of a prioritized hierarchy of interrupt levels. Taking a lock was
    an atomic operation; when your call returned, the lock was held _and_
    no interrupt at or below the specified level could happen. There was
    never an assumption that you could spin on a lock because some other
    CPU would eventually release it; the SMP kernel could run with a single processor.

    This was all in a BSD-ish world; SVR3 STREAMS added an entire parallel ecosystem of SMP issues.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Jun 15 17:55:57 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 6/11/2026 7:20 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/9/2026 6:33 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/7/2026 11:34 AM, MitchAlsup wrote:
    -------------------------

    Notice the weak vs strong CAS in C++. Weak can fail even though the
    location is equal to the comparand. It should have succeeded! strong >>>> does not allow that to occur. So, it can be used for interesting state >>>> machines and such.

    [...]

    The main point is that the software needs to pad and align accordingly. >>>> Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS. >>>> LL/SC is working on a reservation granule which can be different than an >>>> atomic RMW using L2 cache lines.

    If you want the strong CAS it is written::

    LDD Rd,[address]
    ...
    ...
    LL R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    If you want weak CAS it is written:

    LL Rd,[address]
    ...
    ...
    LD R~d[address]
    CMP Rt,Rd,R~d
    PNE Rt,T
    SC Rs,[address]

    It just changes the length of time interference is being observed.

    Pretty interesting! So, we can have spurious failure on the weak CAS and >> none on the strong CAS?

    It is not zero, but it is exponentially fewer; and no architecture is
    going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).

    Just to clarify, strong CAS should never spuriously fail. Right?

    Yes.



    strong CAS failure can be used in a state machine, weak CAS failure
    cannot...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 16 12:19:07 2026
    From Newsgroup: comp.arch

    On 6/14/2026 2:46 PM, Scott Lurndal wrote:
    [email protected] (Kent Dickey) writes:
    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    [email protected] (Kent Dickey) posted:
    [snip]

    I'm just going to follow up on Linux kernel spinlocks.

    For Linux spinlocks, if my memory of this is still right, the Linux
    kernel code often needs to mask interrupts,

    Or transition to a priority level that no interrupt can raise.

    The issue with spinlocks in the Linux kernel is that the kernel a lot of
    the time has interrupts enabled. And the kernel has spinlocks it needs to >> win. But if the kernel thread gets interrupted with a higher priority task >> once it gets the spinlock, but before it finishes with it, this is bad since >> some other thread now has to wait a long time for the spinlock to free up. >> It also might cause deadlocks.

    So the Linux macro is spinlock_irqsave(), which first masks interrupts,
    then wins the spinlock (spinning until it wins). Then do the operation
    needed, then do spinunlock_irqrestore() to release the spinlock and
    allow interrupts again.

    The basic idea is this should be fast. We DO NOT want a higher priority
    interrupt to occur while we own the spinlock. So we just need to
    temporarily mask interrupts. And I'm suggesting this masking and unmasking >> of interrupts should be very lightweight and fast.

    I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe >> they've moved all the important stuff to not use spinlock_irqsave() anymore, >> I don't know. It still shows up a lot in the Linux source though in various >> drivers at least.

    So, by the time control arrives, ISR has a complete thread.state, including >>> the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and
    near-40-bits from the interrupt so it can figure out what to do based on >>> control arriving here. ISR can use any instruction including ESM. ISR
    can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single
    instruction). When ISR performs SVR, SVR check pending threads in interrupt >>> table, and transfers control to any pending thread higher than interrupted >>> thread (SW does not have to check).

    Interrupts do not need to be masked, as all ISRs of lesser priority have >>> already been blocked, and the thread is already in a re-entrant state--
    so, no quirky instruction sequences are necessary. ISRs can be compiled
    with no compiler-knowledge of it being an ISR (and no ASM instructions).

    No, during the kernel spinlock_irqsave(), you have to mask out all
    interrupts, including higher-priority ones.

    While true, spinlock_irqsave() will mask interrupts on the current core/hardware thread; Other cores/threads will continue running and
    taking interrupts that are either 1-of-N interrupts or specifically
    assigned to a core.

    In any operating system where nested interrupts on a core are supported, the interrupt handling code must be very carefully coded to ensure that
    the handler of a lower priority interrupt does not obtain resources
    (e.g. locks) that the higher priority interrupt will require to complete processing the interrupt (classic deadlock). There are other issues
    related to kernel thread stack sizes that also pop up with nested
    interrupts.

    Linux doesn't support nested interrupts on a core, rather it
    defers the interrupt hanndling work to a kernel thread and returns immediately from the interrupt, unless handling the interrupt is trivial.


    Lock/wait free in the interrupt handlers are usually fine. Makes things
    oh so much better. Actually, iirc is sem_post async safe? I think it is.
    --- Synchronet 3.22a-Linux NewsLink 1.2