• Re: ARM CAS vs LL/SC

    From Paul Clayton@[email protected] to comp.arch on Wed May 20 19:24:12 2026
    From Newsgroup: comp.arch

    On 5/13/26 9:48 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 5/11/26 10:38 AM, Scott Lurndal wrote:

    [snip]
    For HW to 'recognize' a LL-OP-SC as an idium, it would have to be
    'local'. 'Local' probably means that no-to-few instructions that
    are not in the LL-OP-SC sequence (with a strong preference for none).

    That is: HW car recognize:

    LDDL R9,[IP,#48]
    ADD R10,R9,#1
    STDL R10,[IP,#48] // same address "pattern"

    as an idiom fairly easy:: but not:

    MOV R8,#48
    MOV R6,#48
    ...
    LDDL R9,[IP,R6]
    ADD R10,R9,#1
    STDL R10,[IP,R8] // same address "pattern" ???

    Said recognized idiom can be packaged up and shipped out through
    the memory hierarchy as an XADD without much trouble.

    In the context of a compiler emitting a specialized atomic
    instruction or a (idiomatically similar) LL/SC sequence, this is
    not an issue. If the compiler can emit "AtADD R10 ← [R9], #1",
    it could emit "LL R10 ← [R9]; ADD R10 ← R10, #1; SC [R9] ← R10;"
    and hardware could convert that to behave as an "AtADD".

    For conventional LL/SC, the example would also fail if R6 ≠ R8,
    and I do not see a reason the compiler would generate such code.
    (Maybe using the "free" check might be justified in some unusual
    case?)

    For a more flexible LL/SC interface — even one that merely
    allowed the SC to target any location in the same cache block
    reservation as the LL — such code might be reasonable and not
    trivially recognized as performing a simple operation on a
    single address (i.e., exportable to simple core-external
    hardware). That would just be a missed optimization opportunity.

    Please elaborate. There are few restrictions on the instructions
    that lie between the LL and SC instructions - I don't see how
    any CPU could translate an arbitrary sequence of instructions
    between the LL and SS into an atomic bus operation efficiently.

    The hardware implementation can choose which LL/SC-guarded
    operations to export,

    No, no, no; HW exports LL and SC and SW uses them as it sees fit.
    The only choice HW has is whether LL and SC are user instructions.

    I do not understand. Hardware can choose not to export "LL R10 ←
    [R9]; SINE R10 ← R10; SC [R9] ← R10;" and perform such as a
    "normal" LL/SC operation. (In-cache hardware is unlikely to
    support transcendental FP operations and nor is PCI likely to
    define support for such soon.)

    which to optimize into a fast path within
    the processor, and which to treat conventionally. Even in a more
    conventional implementation, NAKs or deferred responses might be
    used to promote forward progress.

    This does require software developers to monitor what
    optimizations are implemented, at least if there are
    alternatives with possibly more desired performance
    characteristics.

    Unworkable.

    It would be unworkable if such monitoring was necessary for most
    software. For peak performance, microarchitectural knowledge is
    sometimes needed (cache characteristics, operation latency,
    etc.). Making such specialization rarely measurably beneficial
    and very rarely substantially beneficial seems important.

    I seem to recall you stating that a 2% local performance penalty
    was acceptable.

    For LL/SC fast pathing, there may not be many cases where the
    semantics can be expressed in a diverse enough manner that
    faster expressions would be possible.

    [snip]
    Even with atomic instructions, I get the impression that the
    explicit implementation (performance/scaling) is not
    architecturally defined.

    Scaling is not an Architectural property it is am implementation
    property.

    Yet scaling factors could determine which algorithm is higher
    performance. E.g., if an atomic increment is not coalesced into
    a tree (i.e., mediocre scaling), an algorithm that uses fewer
    such operations but has other overheads might be chosen if/when
    better scaling is desired.

    This might be assigned to a more specific guarantee than the
    Architecture (which is classically defined as
    timing-independent), but that contract might be more general
    than an implementation, whether a "family" of similar
    implementations or a "profile" of non-Architectural behavior.

    An atomic instruction might be
    implemented with LL/SC with a guarantee of eventual success
    (which would hopefully not be as bad as some x86 global lock for
    cache block crossing LOCKed instructions).

    You might be surprised at how glacial that eventual success is.

    Chips and Cheese explored this recently. It is ugly.

    (AArch64's STADD does not guarantee that the addition will be
    done in the cache hierarchy even on a cache miss. The
    architecture merely guarantees that the operation will be
    atomic. An implementation could optimistically use an LL/SC-
    based mechanism and fall back to locking rather than just
    monitoring the reservation to ensure forward progress. With
    out-of-order execution, the actual store to shared memory has
    to be delayed until it is no longer speculative anyway,
    replaying an atomic operation can be faster than a branch
    misprediction — and even a branch misprediction can be fast
    compared to communication between caches.)

    IMO, LL/SC is an obsolete artifact of the past.

    You, I, and Chris seem to agree on this detail.

    Not really. You view LL/SC as too limited a form of optimistic
    concurrency and not worth providing the implementation option of
    smaller reservations or less features than ESM provides. To me,
    My 66000's LOCKED memory instructions are basically the same as
    LL/SC "merely" extended to support six cache lines within an
    atomic scope and providing some other nice performance and
    usability enhancements. (My 66000 is not targeted at the market
    for 16-bit microcontrollers. The extra hardware for ESM is
    small, especially in the context of how useful it can be.)

    Scott Lurndal and Chris M. Thomasson at minimum see a place for single-instruction atomics (and seemingly not primarily to
    improve code density or decode complexity), which I believe were
    strongly rejected for My 66000 because of the need to add more
    instructions as capabilities expanded (like with SIMD).

    Eliminating optimistic atomic operations provided by an
    LL/SC-like mechanism ("LL/SC is an obsolete artifact of the
    past.") is actually contrary to My 66000's design philosophy.

    (Maybe this is just my weird conception of transactional memory
    as a general interface that can have its scope constrained to a
    single "word" granule and still be considered transactional memory.)

    There are certainly advantages to presenting a fully developed
    interface that supports a broad range of uses rather than
    incrementally extending an interface. It may well be wiser to
    provide something like ESM from the start rather than starting
    with classic LL/SC or even cache-block granular LL/SC (with
    multiple loads and stores and the SC able to use a different
    address than the LL) with published plans for extending the
    interface.

    I think ESM could be significantly extended (without adding
    instructions). Any page-aligned copy could be contained in an
    atomic operation by using a cache block monitor as a page
    monitor (presumably with a bitmask to indicate which blocks have
    been copied) — probably too specialized a use case to be worth
    the development and testing costs but possible. Increasing the
    number of cache blocks monitored would not require any new
    instructions. Supporting a read set constrained by L1 cache
    capacity or a conservative filter might not require any new
    instructions (though you have stated experience with initial ESM
    is needed to judge what the next step should be).

    I do wonder if something more like lock elision could be useful
    for increasing concurrency by reducing the number of names used
    to track conflict (lock name versus cache block address).

    I think there is potential with something like versioned memory
    to support more concurrency. A "stale" value would still be
    valid if the entire use of that value can be viewed as occurring
    earlier. (In theory, an ESM operation need not be aborted if a
    single read set cache block is written by another operation. The
    practical problem seems to be that tracking the dependencies for
    even a moderate number of atomic operations is complex. The
    benefit for interleaving atomic operations may very well not be
    worth so much complexity!)




    I disagree. I _feel_ LL/SC is a nice abstract interface that
    not only allows high-performance implementations of simple
    atomics without requiring new software but can also (in theory)
    be extended to multiple reservations (like My 66000's ESM) and
    even to very general transactional memory. (I think a better
    interface is possible with easier decode, better code density,
    and the opportunity for hints and/or directives, but such would
    introduce other costs.)

    I see specific atomic operations as somewhat attractive (idiom
    recognition is nice but it is not free), but potentially
    susceptible to an excessive expansion of instructions. (SIMD
    has similar tradeoffs. I like SIMD, but it has issues.)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Wed May 20 19:33:33 2026
    From Newsgroup: comp.arch

    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for
    small LL/SC code bodies, single words for conflicts with other
    atomic operations — normal loads and stores might still use
    cache block granularity to limit complexity and/or network
    overhead). Normal loads and stores within the code body would
    be "guarded" and the SC could have a different address than the
    LL. I.e., forward compatibility would be possible without adding
    any Architectural state or new instructions while providing new
    functionality.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Wed May 20 19:47:57 2026
    From Newsgroup: comp.arch

    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the bus
    lock and still make forward progress... Sigh... A horrible LL/SC
    thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"☺)

    Of course, the temptation toward "good enough" (not so bad that
    one will lose too many customers) is a problem. I would expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Wed May 20 20:04:32 2026
    From Newsgroup: comp.arch

    On 5/14/26 11:03 AM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/11/26 10:38 AM, Scott Lurndal wrote:

    [snip]
    IME, atomic operations at the instruction set level have
    not been implemented with LL/SC, even on architectures
    that have LL/SC (or LDEX/STREX). The typical atomic
    operations are designed so that they can generate
    atomic PCIe (or other on-chip) transactions which cannot be
    simulated using LL/SC.

    I seem to recall reading Andy Glew mentioning that an x86
    implementation was using such an internal mechanism — and he
    expressed concerns about how it would ensure the Architectural
    guarantees.

    As I wrote before, any simple LL/SC operation that could be
    replaced by the compiler with a simple atomic instruction could
    be recognized by hardware at a special case for optimization and
    made to behave as if it was a single atomic instruction.

    [snip]
    We'll have to agree to disagree. I consider the lack of scalability
    of LL/SC to be a fatal defect.

    I believe the lack of scalability is an implementation choice
    and allowing that poor scalability is an Architectural choice.
    I.e., this is not about the instruction interface so much as
    about quality of implementation (and Architectural or "profile"
    guarantees).

    Maybe practically one cannot trust processor developers (and
    those defining the guarantees) to do the extra work to close
    that gap. Maybe advertising atomic instructions is more
    effective than advertising well-implemented LL/SC. (I am
    sufficiently discouraged about human nature and current human
    society to believe that "well-implemented LL/SC" is a
    cloud-cuckoo-land concept.)

    I wish that at least we could agree that simple LL/SC operations
    could _theoretically_ provide the same guarantees and
    optimization as simple atomic instructions.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Thu May 21 20:17:14 2026
    From Newsgroup: comp.arch

    Paul Clayton <[email protected]> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).


    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for
    small LL/SC code bodies, single words for conflicts with other
    atomic operations — normal loads and stores might still use
    cache block granularity to limit complexity and/or network
    overhead).

    It would be limiting to tie LL/SC to cache lines.

    Atomics are independent of the cache, and can be used with
    both cacheable and non-cacheable memory as well as
    CXL and PCI Express devices.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Thu May 21 20:22:46 2026
    From Newsgroup: comp.arch

    Paul Clayton <[email protected]> writes:
    On 5/14/26 11:03 AM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/11/26 10:38 AM, Scott Lurndal wrote:

    [snip]
    IME, atomic operations at the instruction set level have
    not been implemented with LL/SC, even on architectures
    that have LL/SC (or LDEX/STREX). The typical atomic
    operations are designed so that they can generate
    atomic PCIe (or other on-chip) transactions which cannot be
    simulated using LL/SC.

    I seem to recall reading Andy Glew mentioning that an x86
    implementation was using such an internal mechanism — and he
    expressed concerns about how it would ensure the Architectural
    guarantees.

    As I wrote before, any simple LL/SC operation that could be
    replaced by the compiler with a simple atomic instruction could
    be recognized by hardware at a special case for optimization and
    made to behave as if it was a single atomic instruction.

    [snip]
    We'll have to agree to disagree. I consider the lack of scalability
    of LL/SC to be a fatal defect.

    I believe the lack of scalability is an implementation choice
    and allowing that poor scalability is an Architectural choice.
    I.e., this is not about the instruction interface so much as
    about quality of implementation (and Architectural or "profile"
    guarantees).

    Maybe practically one cannot trust processor developers (and
    those defining the guarantees) to do the extra work to close
    that gap. Maybe advertising atomic instructions is more
    effective than advertising well-implemented LL/SC. (I am
    sufficiently discouraged about human nature and current human
    society to believe that "well-implemented LL/SC" is a
    cloud-cuckoo-land concept.)

    I wish that at least we could agree that simple LL/SC operations
    could _theoretically_ provide the same guarantees and
    optimization as simple atomic instructions.

    Functionality guarantees, yes. Performance has to suffer,
    unless the hardware can analyze all the instructions between
    the LL/SC and abstract them into a single bus operation; which
    I don't see as feasible.

    If you can figure out how to implement LL/SC optimally
    to CXL remote memory for the same set of atomic operations
    provided by PCI express, I'd be interested in the result.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri May 22 17:38:42 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for

    It took me an entire year (2000+ hour) to create ASF after knowing
    how LL/SC works. The "here is the basic idea" was only a couple of
    days--the rest of the time was making "here are a small number of
    cache lines", "make them all available at the same time", in such
    a way that "you can make all updates appear system wide in a single
    instance" or "make them appear to have never been modified" with
    semantics that work EVEN IF YOU DO NOT HAVE A CACHE in the CPU.

    Then there is multiple-LL memory order semantics,
    detection of interference,
    a system arbiter when interference is heavy,
    and what to do when interference prevents completion.

    LL/SC is easy, compared to making multiple-LL and multiple-SC
    work.

    small LL/SC code bodies, single words for conflicts with other
    atomic operations — normal loads and stores might still use
    cache block granularity to limit complexity and/or network
    overhead). Normal loads and stores within the code body would
    be "guarded" and the SC could have a different address than the
    LL. I.e., forward compatibility would be possible without adding
    any Architectural state or new instructions while providing new functionality.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri May 22 17:42:50 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 5/14/26 11:03 AM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/11/26 10:38 AM, Scott Lurndal wrote:

    [snip]
    IME, atomic operations at the instruction set level have
    not been implemented with LL/SC, even on architectures
    that have LL/SC (or LDEX/STREX). The typical atomic
    operations are designed so that they can generate
    atomic PCIe (or other on-chip) transactions which cannot be
    simulated using LL/SC.

    I seem to recall reading Andy Glew mentioning that an x86
    implementation was using such an internal mechanism — and he
    expressed concerns about how it would ensure the Architectural
    guarantees.

    As I wrote before, any simple LL/SC operation that could be
    replaced by the compiler with a simple atomic instruction could
    be recognized by hardware at a special case for optimization and
    made to behave as if it was a single atomic instruction.

    [snip]
    We'll have to agree to disagree. I consider the lack of scalability
    of LL/SC to be a fatal defect.

    I believe the lack of scalability is an implementation choice
    and allowing that poor scalability is an Architectural choice.
    I.e., this is not about the instruction interface so much as
    about quality of implementation (and Architectural or "profile"
    guarantees).

    Maybe practically one cannot trust processor developers (and
    those defining the guarantees) to do the extra work to close
    that gap. Maybe advertising atomic instructions is more
    effective than advertising well-implemented LL/SC. (I am
    sufficiently discouraged about human nature and current human
    society to believe that "well-implemented LL/SC" is a
    cloud-cuckoo-land concept.)

    I wish that at least we could agree that simple LL/SC operations
    could _theoretically_ provide the same guarantees and
    optimization as simple atomic instructions.

    You cannot make an LL/SC architecture that can do both Test-and-set
    and Compare-and-swap with commonly held semantics of T&S and CAS.
    One requires monitoring the LL address for interference from the
    LL to the SC, the other requires not knowing about interference
    and only checking of data-equivalence at SC.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun May 24 17:24:47 2026
    From Newsgroup: comp.arch

    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    The less specifically the size is defined, the less performance-
    portable software becomes. One can address this with something
    like RISC-V profiles, in which sizes can be more specific and
    software that cares will specify a target profile rather than an
    Architecture (version).

    Since granule size can influence what code is most efficient,
    even recompiling is not an excellent option. So for a class of
    applications, having a single target seems to make sense.

    Being able to test software on a development machine can also be
    useful, so desired performance compatibility might be broader
    than a application type.

    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for
    small LL/SC code bodies, single words for conflicts with other
    atomic operations — normal loads and stores might still use
    cache block granularity to limit complexity and/or network
    overhead).

    It would be limiting to tie LL/SC to cache lines.

    It is not tying the operation to cache lines but to cache
    line granules in terms of external interference monitoring
    (and, in the case of a modest extension beyond traditional
    LL/SC, the scope of the read/write set).

    Atomics are independent of the cache, and can be used with
    both cacheable and non-cacheable memory as well as
    CXL and PCI Express devices.

    I am not certain that LL/SC (or an extended form of such)
    could not be used with "I/O" addresses. This merely requires
    the equivalent of one cache line "cache" (or the largest
    guaranteed size of a transaction) and some form of
    monitoring ("coherence") of such memory addresses.

    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    For other operations, I am not certain what semantics make
    sense. If a read at one address changes the behavior of another
    access, does "atomic" behavior mean that the later in program
    order access happens before the I/O agent changes the access
    behavior or does it mean that the atomic action blocks "ordinary
    software agents" but lets side effects caused by the action to
    occur in program order? The former seems more orthogonal — all
    agents are treated the same — but the latter seems more
    consistent with actions should occur the same as if no other
    threads were running. If the I/O agent is considered just
    another agent, then I/O addresses with side effects within the
    granule might reasonably be considered interference causing the
    transaction to always fail.

    I do not know how an atomic operation instruction would handle
    a perverse case. If such instructions generate an exception, a
    LL/SC sequence could do the same or produce an "always fail"
    transaction failure indicator (probably with additional
    metadata to indicate the nature of the failure).

    I do not know what the monitoring implications for supporting
    I/O atomics would be. For simple operations, translation to the
    equivalent of an atomic instruction seems reasonable. However,
    if more extensive operations are permitted, then considerable
    care seems necessary to define semantics that are
    comprehensible, testable, and cost-effective.

    My perception is that PCI-E atomics are not meant for
    non-idempotent storage. (I do not know how ARM atomic
    instructions handle such cases. [I am being lazy and not waiting
    to look up
    this information and edit this before posting.])
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun May 24 21:32:47 2026
    From Newsgroup: comp.arch

    On 5/22/26 1:38 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for

    It took me an entire year (2000+ hour) to create ASF after knowing
    how LL/SC works. The "here is the basic idea" was only a couple of
    days--

    This is one of the benefits of others' recording their
    experiments. The "basic idea" becomes a trivial extension of
    prior work or at least "obvious to one skilled in the art".

    (Because I only think about hardware at a fairly high level of
    abstraction, I can hand wave a lot of issues. I am also almost
    completely ignorant of "system-level" issues and other complex
    interactions. I do not mean to belittle the effort behind the
    original ASF — though from some of your statements dealing with
    "business issues" was a significant part of the effort and kept
    AMD's ASF from having some of the features you developed.)

    By the way, is there a reason that ESM did not include the
    same operation provided by RELEASE in AMD's ASF? Is removing
    entries from the transaction not worthwhile for the sort of
    smallish transactions targeted by ESM? (It is also possible that
    I missed the presence of such in ESM or that my version of
    Principles of Operations (28 January 2020) is so out-of-date
    that it is not accurate for ESM anymore.)

    Discarding read set members seems tricky for software as it
    would have to guarantee that no "overlapping" reads occurred.
    Such is possible if multiple data structures do not share a
    cache block (or more complexly if any possible cache block
    sharers are never involved in together in a transaction that
    discards possible sharers).

    The ASF justification for RELEASE — "RELEASE can be used to
    circumvent ASF's capacity limitations when traversing
    potentially long chains of pointers." — is a limited use case
    *and* being a "hint" it did not increase the guaranteed capacity
    (four 64-byte memory regions) so a transaction would still
    require fallback code.

    ASF also supported "unprotected" (transaction escaping) memory
    accesses (those that do not use the LOCK prefix), which I think
    ESM does not provide. Such could be useful for thread-local and
    "ROM" accesses (to avoid capacity issues) and for "shared"
    unconditional accesses (which might be guarded by a rarely
    contested lock, have software handling for inconsistent state,
    or be hint-like such as software performance counters). I guess
    such could also be useful for any data that is "unreachable"
    until the transaction commits such as a new memory allocation,
    but this seems the same as thread-local memory or a lock-guarded
    memory set.

    the rest of the time was making "here are a small number of
    cache lines", "make them all available at the same time", in such
    a way that "you can make all updates appear system wide in a single
    instance" or "make them appear to have never been modified" with
    semantics that work EVEN IF YOU DO NOT HAVE A CACHE in the CPU.

    The practical implementation aspects naturally take longer.

    I would have guessed that the NAK trick and the interference
    counter trick did not come to mind in the first moment. The NAK
    trick may have been more obvious in concept ("almost done, just
    give me a sec") but working such out that it would not cause
    performance issues or even (practically) lock-ups is harder.
    (The replay problem for an out-of-order scheduler seems simple
    enough in concept but is a hard problem in an actual high-
    performance design.)

    The usability tuning (and extension directions) tends to require
    actual hardware (simulation may be too expensive and limited
    mostly to internal exploration — having users attempt crazy
    things can be helpful). Some extension possibilities are
    obvious (and some obviously practical at least in terms of
    hardware cost), but even some of the tuning of the performance
    of the existing interface might require years of experience with
    use of the hardware by a reasonably broad range of users.

    (I do not think not having a proper cache is so much of a
    problem. Yes, the coherence interface would have be added to
    detect interference and enough buffering to supply the required
    capacity. On the other hand, I would be tempted to use that
    storage for other things, maybe prefetch buffers. For non-
    scalable systems, it might be practical to share buffers among
    multiple agents, e.g., allowing 16 cache blocks among 8 cores
    with no cache, which could also move coherence to that central
    storage, but that seems to imply writing directly to such
    "distant" storage.)

    I think another factor is having the atomic operation be
    inexpensive both in the case of no interference and in the retry
    case. I got the impression that Intel's TMX (and lock ellision)
    were expensive both in set-up and especially in retry; part of
    this may be from competing with slow LOCK-based atomic
    instructions.

    In theory, a failure should not be much more expensive than a
    branch misprediction. (An "advanced" implementation could
    provide faster retries for small transactions by keeping the
    decoded instructions in schedulers and requesting updates from
    writes by external agents [i.e., as soon as the external write
    committed or the invalidation response was received if the write
    committed before that, an implicit read request would be acted
    upon].) Even with L1 cache capacity transactions, clearing all
    the transaction bits can be fast and if most of the transaction
    was read set a retry should run fairly fast.

    (Write set cache blocks would normally be written back to L2 to
    provide a checkpoint, so retrying would involve an L1 miss for
    all write set blocks. This could be avoided for small write sets
    that fit in load-store queue entries or some other buffer.
    Keeping a list of written cache blocks would also allow
    prefetching (and retention in L2 so checkpointing would not
    require writebacks).)

    Then there is multiple-LL memory order semantics,
    detection of interference,
    a system arbiter when interference is heavy,
    and what to do when interference prevents completion.

    LL/SC is easy, compared to making multiple-LL and multiple-SC
    work.

    Which is one reason that I though a cache line granular LL/SC
    might be a reasonable next step beyond traditional LL/SC.

    Single address atomic operations (with larger granule) are
    also attractive in facilitating commit order optimization.
    (Many single "word" atomic operations facilitate more
    optimizations like remote execution and coalescing for
    operations like atomic add.)

    Of course, once someone has worked out how to do multiple
    cache block reservations well, limiting an implementation to
    one cache block might not be reasonable.

    One issue with a LL/SC-oriented interface is that a fully
    external operation is harder to express. Zeroing all the
    touched registers immediately after the operation technically
    would communicate that the loaded values and their descendants
    were not preserved outside the transaction, but that would be
    a messy idiom to detect. With a read-as-zero-or-last-write
    register, single value "push-only" operations might be easier
    to detect; though detecting this for LL R2 ← [R1];
    OP R2 ← R3, R4; SC [R1] ← R2; might not be that much more
    difficult than for LL R0 ← [R1]; OP R0 ← R3, R4; SC [R1] ← R0;
    (special casing one register that is already special cased
    may be a little easier).


    Side comment: ESM and extended LL/SC mechanisms seem to have a
    problem with expressing "exportable" operations even at the
    close of a transaction. Operations that could be performed
    remotely or even coalesced could avoid "false" interference if
    hardware knew that the operation did not use the loaded value
    except to compute a new value and did not use the new value
    within the transaction except to store it to the original
    address (though copy and perhaps swap operations might also be
    practical targets for exporting).

    E.g., a bank transfer might only need to read the current
    balance from the provider account (to ensure there are
    sufficient funds) and subtract the amount transferred and then
    use an exportable atomic add for the other account. (Software
    might guarantee no overflow fairly easily with 64-bit integers
    (and no transfers larger than the U.S. debt ☹) or hardware might
    provide an exception and software could correct the problem.)

    The exported operation still needs to be conditional on the
    transaction (otherwise it could just be a separate transaction,
    though that might be very expensive in some implementations),
    but it does not have the same kind of data dependency that other
    atomic operations have.

    In theory, multiple operations could be exported if they are all
    dependent only on the other parts of the transaction committing.
    However, ensuring that the ordering guarantees are enforced
    seems likely to be very difficult with more than one simple
    atomic operation.

    I _think_ any single store (which could include a cache block
    with a write mask) could *theoretically* be exported if the
    block was not in the transactions read set. This would still
    delay the commitment of the transaction until after the outer
    memory system confirmed that there were no conflicts (so the
    latency would be similar to cache miss handling) because the
    read set still needs to be guarded, but such might reduce
    conflicts either by using finer-grained interference detection
    or by facilitating optimization of ordering (using more
    central arbitration).

    I am rather skeptical that there are significant uses for such
    "blind stores" much less enough to justify such complexity. Yet
    if I am reasoning correctly, such could slightly improve
    performance of that corner case.

    Another side comment: in theory, a shared bump counter
    allocation could use an exported atomic add and not need the
    actual result until after the transaction if a temporary pseudo-
    address is assigned as a placeholder value for the address that
    will be generated by the atomic add (and if there was a
    guarantee that the allocation would succeed or dropping the data
    was acceptable on allocation failure — providing a "red zone"
    of memory for such "failed" allocations would be another
    option and when the red zone reaches a watermark all such
    allocation optimizations are not performed).

    (Memory allocation in general is separable in this manner. The
    actual address returned by the allocation in not needed until
    the allocation is visible outside of the thread; a placeholder
    address can provide coherence within a thread. In theory, a
    placeholder address could even be used between threads if
    part of the virtual address space is reserved for such uses, but
    replacing such uses seems like doing garbage collection for a
    C program. With capabilities or marked pointers, such data
    would be distinguishable as pointers and so might be garbage
    collected at some cost. [Hmm. Would there be any value in a
    page-granular protection that only prohibited writing to
    (marked) pointers?])

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun May 24 23:35:20 2026
    From Newsgroup: comp.arch

    On 5/21/26 4:22 PM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    [snip]
    I wish that at least we could agree that simple LL/SC operations
    could _theoretically_ provide the same guarantees and
    optimization as simple atomic instructions.

    Functionality guarantees, yes. Performance has to suffer,
    unless the hardware can analyze all the instructions between
    the LL/SC and abstract them into a single bus operation; which
    I don't see as feasible.

    If you can figure out how to implement LL/SC optimally
    to CXL remote memory for the same set of atomic operations
    provided by PCI express, I'd be interested in the result.

    I am not a hardware designer, but recognizing LL Rx ← [Ry];
    OP Rx ← Ra, Rb; SC [Ry] ← Rx and converting it to the
    appropriate PCIe atomic (when "OP" is a PCIe supported atomic
    operation) does not seem that difficult. Yes, three instruction
    idioms are more complex than two instruction idioms, but the
    first part of detection (destination of first instruction is the
    same as the source for the following instruction) is a common
    idiom detection factor and necessary even for in-order
    superscalar execution.

    CMP+Jn fusion in x86 is a little simpler since Jn will always be
    dependent on an immediately preceding CMP (there is only one
    flags register), but it still requires comparing two opcodes.

    For the proposed limited LL/SC fusion, I think the following
    logic suffices:

    if I[0].opcode == LL
    and
    if I[0].Rdst == (I[1].Rsrc1 or I[0].Rsrc2)
    and
    if I[0].Rdst == I[2].Rsrc1
    and
    if I[2].opcode == SC
    and
    if I[1].opcode == EXPORTABLE

    It would probably be acceptable to assume that if the third
    instruction is a SC, it will also meet the pattern (so a very
    unusual misspeculation — conditionally storing a value unrelated
    to the linked load — could be handled slowly) and check that the
    register is the same later (though that register check might not
    add latency since dependencies need to be checked anyway). It
    may also be acceptable to delay the operation check as the LL
    address generation is required regardless of how the
    operation is actually handled.

    An alternative would be to fuse every three instruction LL/SC
    sequence and crack the fused instruction later if the operation
    is not one supported in a fused format. (This cracking could be
    independent of whether the operation can be exported. An
    implementation might have a scheduler that fires an operation
    multiple times in different "modes" such that it could execute
    this fused operation internally.)

    For three-instruction LL/SC sequences, there is also very little
    reason for the intermediate instruction not to use the LL result
    as a input value. So one could probably speculate that an
    operand is the same and replay from fetch on misspeculation.

    This would have the fusion be dependent only on one comparison
    of about two sets of about six bits (admittedly, separated by
    the length of two instructions). For a narrow decode
    implementation this seems inappropriate.

    Such detection would require some extra buffering (even a
    wide decode implementation would have to handle crossing decode
    chunks), but such seems a modest overhead.

    Delaying a potentially exportable atomic operation by a cycle or
    two would also seem not to be very problematic. Even in an in-
    order implementation, the atomic operation cannot be exported
    until after all previous operations are guaranteed not to
    produce exceptions and the operands are available.

    I think I would prefer a specialized form of LL that produced an
    implicit SC after one (or N) instructions both to assist such
    idiom recognition and to provide code density (no SC instruction
    and no success test instruction — in that way similar to IBM's
    limited transactions which are guaranteed to complete). Even
    with an N-instruction body LL, there would only be a comparison
    of one opcode to a constant and the count field to one to detect
    the single-instruction case with an additional opcode comparison
    to determine if it is exportable. This does introduce one
    "extra" opcode, but it avoids adding an opcode for every
    possible exportable atomic operation and facilitates old
    software using new hardware features. The code density benefit
    would be greatest for single instruction bodies (100% extra
    overhead relative to a specialized atomic instruction compared
    to 300% overhead — SC, branch-on-interference — of traditional
    LL/SC), but I am not certain if limiting the opcode to that is
    best.

    I also am inclined to provide an interface that allows avoiding
    an explicit interference test and branch. For simple
    transactions that hardware can reasonably guarantee will
    complete, automatic retry is practical. Having a special
    exception for "always fail" or "recommend no retry" would add
    overhead associated with a single handler managing multiple
    failure points (admittedly on a generally less critical path)
    with the only benefit being slightly shorter dynamic code by
    removing a branch instruction. Since the cases where completion
    could not be guaranteed would tend to be long, the cost of a
    branch instruction may not be significant.

    (I just noticed that My 66000 has "predicate on the condition of interference", which may allow escaping memory accesses.)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue May 26 12:44:20 2026
    From Newsgroup: comp.arch

    On 5/24/2026 2:24 PM, Paul Clayton wrote:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    A large granule then we need to worry about a single load from say via
    false sharing or something... Well, can that case the SC to fail?

    FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
    using a hashed lock where address of a target word is used to index into
    an array. Something akin to:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue May 26 20:58:52 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 5/24/2026 2:24 PM, Paul Clayton wrote:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    A large granule then we need to worry about a single load from say via
    false sharing or something... Well, can that case the SC to fail?

    Does this "LL/SC and other core instructions synchronization means" not
    fall from "desirable" when one has a complete set of to-memory() atomic
    actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the
    quadratic and cubic interconnect traffic in the system which are the
    real point of slow synchronization ??!!?? while being guaranteed to
    work without an interference and can be done for both cacheable and
    unCacheable memory accesses ??!!??

    FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
    using a hashed lock where address of a target word is used to index into
    an array. Something akin to:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue May 26 14:00:36 2026
    From Newsgroup: comp.arch

    On 5/26/2026 1:58 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 5/24/2026 2:24 PM, Paul Clayton wrote:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    A large granule then we need to worry about a single load from say via
    false sharing or something... Well, can that case the SC to fail?

    Does this "LL/SC and other core instructions synchronization means" not
    fall from "desirable" when one has a complete set of to-memory() atomic actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the quadratic and cubic interconnect traffic in the system which are the
    real point of slow synchronization ??!!?? while being guaranteed to
    work without an interference and can be done for both cacheable and unCacheable memory accesses ??!!??

    Take a look some S/HTM... A single load can cause a retry, and lead to
    live lock?




    FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
    using a hashed lock where address of a target word is used to index into
    an array. Something akin to:

    https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Wed May 27 14:25:19 2026
    From Newsgroup: comp.arch

    Paul Clayton <[email protected]> writes:
    On 5/21/26 4:17 PM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
    [snip]
    Wrt LL/SC, how large is the reservation granule? PPC has some
    insight...

    Usually the reservation granule is the cache block in order to
    exploit existing cache coherence mechanisms.

    ARM architectures allow (but don't encourage) a reservation
    granule that covers the entire address space (e.g. see the
    ARMv7 ARM).

    Any larger granule assures correctness but hinders performance.
    A global lock works but does not allow much parallelism.

    The less specifically the size is defined, the less performance-
    portable software becomes. One can address this with something
    like RISC-V profiles, in which sizes can be more specific and
    software that cares will specify a target profile rather than an
    Architecture (version).

    Since granule size can influence what code is most efficient,
    even recompiling is not an excellent option. So for a class of
    applications, having a single target seems to make sense.

    Being able to test software on a development machine can also be
    useful, so desired performance compatibility might be broader
    than a application type.

    I feel there is relatively little to prevent LL/SC semantics
    from being extended to support multiple cache blocks (or, for
    small LL/SC code bodies, single words for conflicts with other
    atomic operations — normal loads and stores might still use
    cache block granularity to limit complexity and/or network
    overhead).

    It would be limiting to tie LL/SC to cache lines.

    It is not tying the operation to cache lines but to cache
    line granules in terms of external interference monitoring
    (and, in the case of a modest extension beyond traditional
    LL/SC, the scope of the read/write set).

    Atomics are independent of the cache, and can be used with
    both cacheable and non-cacheable memory as well as
    CXL and PCI Express devices.

    I am not certain that LL/SC (or an extended form of such)
    could not be used with "I/O" addresses. This merely requires
    the equivalent of one cache line "cache" (or the largest
    guaranteed size of a transaction) and some form of
    monitoring ("coherence") of such memory addresses.

    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    If true in the general case (and I'm not sure I see how it
    can be), why bother to add the hardware to do so when
    atomics are generally superior, scalable, simpler to implement and
    higher performance?


    For other operations, I am not certain what semantics make
    sense. If a read at one address changes the behavior of another
    access, does "atomic" behavior mean that the later in program
    order access happens before the I/O agent changes the access
    behavior or does it mean that the atomic action blocks "ordinary
    software agents" but lets side effects caused by the action to
    occur in program order?

    Atomics ensure that the access is atomic with respect to
    all other accessors - ensuring that the other accessors
    will not see inconsistent data.

    Atomics can be used as a basis (e.g. atomic test&set) to
    guard a critical section, but they're also useful for
    adjusting shared counters et alia.

    My perception is that PCI-E atomics are not meant for
    non-idempotent storage. (I do not know how ARM atomic
    instructions handle such cases.

    See above.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed May 27 14:08:17 2026
    From Newsgroup: comp.arch

    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the bus lock
    and still make forward progress... Sigh... A horrible LL/SC thing can
    live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.


    A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
    will complete before the heat death of the universe"☺)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is hyper
    important to help the software pad and align to remove any false sharing
    on said granule. No? But...

    Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
    even if you do everything right in principle, you're tuning against a
    number you had to dig out of a forum post or reverse engineer yourself.
    Scary! ;^o


    Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
    documented guarantees of sufficient generality to have the cognitive
    load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very few.




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed May 27 14:14:11 2026
    From Newsgroup: comp.arch

    On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
    [...]
    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very few.

    A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
    LL/SC cannot... ?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed May 27 14:24:36 2026
    From Newsgroup: comp.arch

    On 5/27/2026 2:14 PM, Chris M. Thomasson wrote:
    On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
    [...]
    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very
    few.

    A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
    LL/SC cannot... ?



    For x86, its "easier" for sure... pad _and_ align on a l2 cache line,
    and you should be ideal... SO NO straddle a cache line and execute a
    damn LOCK RMW on it. Bus lock for sure.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu May 28 01:27:36 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the bus lock
    and still make forward progress... Sigh... A horrible LL/SC thing can
    live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.


    A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
    will complete before the heat death of the universe"☺)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is hyper important to help the software pad and align to remove any false sharing
    on said granule. No? But...

    Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
    even if you do everything right in principle, you're tuning against a
    number you had to dig out of a forum post or reverse engineer yourself. Scary! ;^o


    Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
    documented guarantees of sufficient generality to have the cognitive
    load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before you
    conclude something's fundamentally broken? For me the answer is: very few.

    Following a "SC failure" My 66000 provides a readable control register
    called 'WHY' which contains a number. Negative numbers represent kinds
    of failures {resource limit exceeded, time out, ...} while positive
    values indicate how far back in-line your request is (measured by a
    resource which has unique system-wide visibility to ATOMIC-order}.

    Thus, SW can use WHY to reach deeper into the Queue of pending work and
    select a unit that nobody else is going to go after on the next iteration.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun May 31 21:32:14 2026
    From Newsgroup: comp.arch

    On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the
    bus lock and still make forward progress... Sigh... A
    horrible LL/SC thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations. IBM's constrained
    transactions guaranteed success of a transaction if it met
    certain criteria. A single-instruction LL/SC body could be
    Architecturally guaranteed to perform not only successfully but
    with some performance characteristics.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"☺)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is
    hyper important to help the software pad and align to remove any
    false sharing on said granule. No? But...

    I disagree. A guarantee that has a time scale beyond human
    civilization much less the lifetime of the hardware seems to
    have extremely little use. It may be reasonable to assume
    reasonable timescales for such guarantees, but a simple
    guarantee of eventual completion (if the system is kept
    operating) might be given if the profit motive seems sufficient.

    (I am not certain if even x86 XLOCK operations are absolutely
    guaranteed to complete in the presence of context switches. A
    hardware thread might be always be interrupted while it is
    performing the operation and if the hardware does not delay
    interrupt handling until after the operation completes, then the
    operation may never complete. This may be so extraordinarily
    improbable that an undetected error in ECC-protected memory
    might be more likely, in which case it is not really important.)

    I think one really wants the time scale explicitly declared as
    well as information about the range of latency and causes. Even
    5ms latency can seem like forever.

    Here's the deeper problem can rear its ugly head... Vendors
    often don't document it? Or they document it inconsistently
    across revisions? So even if you do everything right in
    principle, you're tuning against a number you had to dig out of
    a forum post or reverse engineer yourself. Scary! ;^o

    Ugh!

    Architecting a lot of such factors might help with documentation
    as Architecture is more stable than microarchitecture, but I do
    not think typical companies have the incentives for excellence
    in documentation. If the only consequence of mistakes in
    Architectural documentation is a few software developers
    grumbling, keeping even such stable documentation consistent and
    correct (and abiding by the old/existing Architectural contract)
    seems unlikely to seem important. In fact, if the inability to
    optimize forces people to buy more (or more expensive) hardware,
    poor documentation can mean higher profits.

    Of course, the temptation toward "good enough" (not so bad
    that one will lose too many customers) is a problem. I would
    expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before
    you conclude something's fundamentally broken? For me the answer
    is: very few.

    Again, I think this is concerned with "quality of
    implementation" (and Architectural guarantees about such) than
    about the interface at an instruction level.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun May 31 23:26:39 2026
    From Newsgroup: comp.arch

    On 5/27/26 10:25 AM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    [snip]
    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    If true in the general case (and I'm not sure I see how it
    can be), why bother to add the hardware to do so when
    atomics are generally superior, scalable, simpler to implement and
    higher performance?

    A more generic interface has some advantages.

    I already mentioned that old software that was developed when
    there was not an atomic ["expensive" operation] instruction
    could benefit from idiom recognition on new hardware. (An
    alternative to that would be patching or recompiling the
    software. While I prefer a more abstract software distribution
    format for its ability to avoid having to move things to
    Architecture and even potentially perform microarchitectural
    optimizations at non-instruction granularity, such seems
    unlikely to be common any time soon.)

    Even with atomic instructions, the Architecture generally does
    not provide guarantees about scalability. I doubt any
    implementation would stop-the-world to perform an atomic
    operation (because the performance penalty would be quite
    noticeable), but I can easily imagine an implementation
    waiting until the atomic operation is not speculative before
    starting it.

    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized. (System calls
    have similar excessive, in my opinion, latency. Some of this may
    be from cruft, but I received the impression that optimization
    effort is a significant cause for the higher latency.)

    I do not like the code bloat and decode complexity of using
    LL/SC for simple atomic operations. Unfortunately, even a LL-and-SC-after-next-compute instruction (which would allow
    arbitrary single compute instruction atomics and might be
    extended by function call instructions to microcode) would have
    the bloat of redundant register name encoding. Even a diversity
    of addressing modes may be excessive for atomic operations, if
    simple register-indirect with no offset is sufficiently common.

    With destructive operations (like x86), it would be possible to
    avoid the register name overhead by having the LL instruction
    not include a register name, taking it from the following
    compute instruction. For an LL instruction lacking a register
    name, if "microcode" calls were to be supported such call
    instructions would need to specify a register name (or use a
    defined, possibly function-specific ABI). An opcode-only LL
    might reasonably have space for hint/directive metadata, which
    might be useful.

    My objection to specific atomic instructions is mainly that
    they are specific. If an operation later becomes a reasonable
    target for such an instruction, a new instruction must be
    allocated to provide that operation. That new instruction would
    only be available to new software.

    For other operations, I am not certain what semantics make
    sense. If a read at one address changes the behavior of another
    access, does "atomic" behavior mean that the later in program
    order access happens before the I/O agent changes the access
    behavior or does it mean that the atomic action blocks "ordinary
    software agents" but lets side effects caused by the action to
    occur in program order?

    Atomics ensure that the access is atomic with respect to
    all other accessors - ensuring that the other accessors
    will not see inconsistent data.

    I think I communicated poorly. I was thinking about what the
    appropriate behavior of an atomic add operation (however
    encoded) should be when targeting an address with side effects.
    The simple choice is "don't do that" (undefined behavior). The
    slightly more complex choice is fault on bad behavior.

    Yet one might argue that targeting such an address for an atomic
    operation could be useful in some particular context. Supporting
    such means making a choice of how the side effect is handled.

    (I am inclined to just having such fault, but that needs to be
    defined as it means that acquiring a lock, performing a read,
    operating on the read value, writing the result, and releasing
    the lock is not functionally equivalent to an atomic operation.)

    Is the read side effect ignored? For side effects limited to the
    accessed address, this would seem to be the same as the side
    effect happening "between" the read and the write. For side
    effects with external effects, those would also be suppressed,
    making such different than having the side effect occur
    "between" the read and the write.

    Is the side effect done "between" the read and the write of the
    "atomic" operation? This would presumably overwrite the address-
    local side effect while producing other side effects, which
    might seem very strange as the side effect would use the old
    value for any value-dependent side effects.

    Is the side effect performed after the atomic operation? This
    could also be confusing.

    Even if the side effect does not change the value at the
    address, the value before or after the atomic operation might be
    used to determine what the side effect is.

    Removing side effects places atomics in a special category,
    which may be reasonable but is not a choice 100% obvious to
    everyone. Consistently and sensibly ordering side effects with
    atomic seems challenging.

    Such side effects are like atomic operations, which leads to a
    conflict. If the non-side effect operation is truly atomic, one
    might break the definition of the side effect.

    I would guess that each device would choose its supported
    behavior, but that would seem to add unnecessary complexity.
    Just faulting on such use seems sensible, but then one needs
    to distinguish between addresses that fault and addresses that
    allow atomic operations.

    I just looked it up, Power (version 2.06B) as an example
    restricts Load Reserved to coherent memory: "The storage
    location specified by the Load And Reserve and Store Conditional
    instructions must be in storage that is Memory Coherence
    Required if the location may be modified by another processor or
    mechanism. If the specified location is in storage that is Write
    Through Required or Caching Inhibited, the system data storage
    error handler or the system alignment error handler is invoked
    for the Server environment and may be invoked for the Embedded
    environment." I therefore suspect that even if such was
    extended to support PCI-E atomics, addresses with side effects
    would fault.

    Atomics can be used as a basis (e.g. atomic test&set) to
    guard a critical section, but they're also useful for
    adjusting shared counters et alia.

    (There seem to be a lot of alia/other uses. Atomic OR seems like
    a useful means of supporting multiple "named" read locks; if
    implemented aggressively, atomic OR could even be used for
    bit-sized locks in combination with atomic AND.)

    My perception is that PCI-E atomics are not meant for
    non-idempotent storage. (I do not know how ARM atomic
    instructions handle such cases.

    See above.

    The "above" statement was not clear to me. An I/O device's
    read side effect does not play nicely with the concept of
    atomic. One could define the atomic not to actually "read"
    the device register (no side effect), but I think one
    cannot just say the operation is atomic.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 01:27:53 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 5/27/26 10:25 AM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    [snip]
    In the case of a simple operation, as has been stated before,
    the LL/SC sequence can be converted to the equivalent of an
    atomic instruction.

    If true in the general case (and I'm not sure I see how it
    can be), why bother to add the hardware to do so when
    atomics are generally superior, scalable, simpler to implement and
    higher performance?

    A more generic interface has some advantages.

    I already mentioned that old software that was developed when
    there was not an atomic ["expensive" operation] instruction
    could benefit from idiom recognition on new hardware. (An
    alternative to that would be patching or recompiling the
    software. While I prefer a more abstract software distribution
    format for its ability to avoid having to move things to
    Architecture and even potentially perform microarchitectural
    optimizations at non-instruction granularity, such seems
    unlikely to be common any time soon.)

    Even with atomic instructions, the Architecture generally does
    not provide guarantees about scalability. I doubt any
    implementation would stop-the-world to perform an atomic
    operation (because the performance penalty would be quite
    noticeable), but I can easily imagine an implementation
    waiting until the atomic operation is not speculative before
    starting it.

    Understand that LOCK XADD [...] to MMI/O does exactly this !

    But note: XADD [...] never causes more than necessary bus traffic
    and as an atomic event, never fails, never needs retry, ...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 01:38:51 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the
    bus lock and still make forward progress... Sigh... A
    horrible LL/SC thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent guarantees. Using LL/SC to emulate them is a different story.

    Academic LL/SC: I can agree with this statement. But neither ASF nor
    ESM has problems making stronger guarantees--and I did this over
    {7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
    impose limitation on instruction order and SW has to understand
    several nonVoneumann properties of the ATOMIC event.

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    That standard academic stuff cannot, does not mean it absolutely
    cannot be done.

    IBM's constrained
    transactions guaranteed success of a transaction if it met
    certain criteria. A single-instruction LL/SC body could be
    Architecturally guaranteed to perform not only successfully but
    with some performance characteristics.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"☺)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is
    hyper important to help the software pad and align to remove any
    false sharing on said granule. No? But...

    I disagree. A guarantee that has a time scale beyond human
    civilization much less the lifetime of the hardware seems to
    have extremely little use. It may be reasonable to assume
    reasonable timescales for such guarantees, but a simple
    guarantee of eventual completion (if the system is kept
    operating) might be given if the profit motive seems sufficient.

    (I am not certain if even x86 XLOCK operations are absolutely
    guaranteed to complete in the presence of context switches. A
    hardware thread might be always be interrupted while it is
    performing the operation and if the hardware does not delay
    interrupt handling until after the operation completes, then the
    operation may never complete. This may be so extraordinarily
    improbable that an undetected error in ECC-protected memory
    might be more likely, in which case it is not really important.)

    I think one really wants the time scale explicitly declared as
    well as information about the range of latency and causes. Even
    5ms latency can seem like forever.

    Here's the deeper problem can rear its ugly head... Vendors
    often don't document it? Or they document it inconsistently
    across revisions? So even if you do everything right in
    principle, you're tuning against a number you had to dig out of
    a forum post or reverse engineer yourself. Scary! ;^o

    Ugh!

    Architecting a lot of such factors might help with documentation
    as Architecture is more stable than microarchitecture, but I do
    not think typical companies have the incentives for excellence
    in documentation. If the only consequence of mistakes in
    Architectural documentation is a few software developers
    grumbling, keeping even such stable documentation consistent and
    correct (and abiding by the old/existing Architectural contract)
    seems unlikely to seem important. In fact, if the inability to
    optimize forces people to buy more (or more expensive) hardware,
    poor documentation can mean higher profits.

    It took me more than 35 years to learn how to write µArchitecture
    documents such that a malevolent engineer could not misunderstand
    what was written and specified. Try it, it is not easy. It is not
    something that can be taught, but it is something that diligence
    and perseverance can deliver.

    Of course, the temptation toward "good enough" (not so bad
    that one will lose too many customers) is a problem. I would
    expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before
    you conclude something's fundamentally broken? For me the answer
    is: very few.

    How many SC failures are acceptable if there are 1024 cores all
    going after the same lock ??

    Again, I think this is concerned with "quality of
    implementation" (and Architectural guarantees about such) than
    about the interface at an instruction level.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Tue Jun 2 14:42:12 2026
    From Newsgroup: comp.arch

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees, even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
    | - The transaction executes no more than 32 instructions.
    | - All instructions within the transaction must be within 256
    | contiguous bytes of storage.
    | - The only branches you may use are relative branches that
    | branch forward (so there can be no loops).
    | - All SS and SSE-format instructions may not be used.
    | - Additional general instructions may not be used.
    | - The transaction's storage operands may not access more than
    | four octowords.
    | - The transaction may not access storage operands in any 4 |K-
    | byte blocks that contain the 256 bytes of storage beginning
    | with the TBEGINC instruction.
    | - Operand references must be within a single doubleword,
    | except for some of the "multiple" instructions for which the
    | limitation is a single octoword.

    I think I read that the first implementation made an optimistic
    attempt and later — I do not remember if multiple optimistic
    attempts were made — a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 19:36:06 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    The core is going to package this instruction up and ship it
    across the interconnect as a fire-and-forget transaction.

    The interconnect is going to route the package towards either a
    cache having write permission or a control register.

    The cache or control register will perform the packaged calculation
    and optionally send back the previous value.

    The core receives the optional previous value and the memory-atomic
    is complete:: 2 interconnect messages, both smaller than a cache line,
    not cache lines are moved, and the calculation cannot fail. The only
    failure mode is if the interconnect message fails ECC check in either directions.

    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees,

    If so, you will be surprised when you implement one.

    even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Where it becomes cubically harder.

    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    SW TM wants the TM model to support an unbounded number of memory
    elements in the single transaction. HW does not do unbounded.

    Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
    | - The transaction executes no more than 32 instructions.
    I used a timer--to the same ends.
    | - All instructions within the transaction must be within 256
    | contiguous bytes of storage.
    I allow calls to subroutines in the event.
    | - The only branches you may use are relative branches that
    | branch forward (so there can be no loops).
    Loops are OK as long as the timer does not go off.
    | - All SS and SSE-format instructions may not be used.
    Agreed.
    | - Additional general instructions may not be used.
    I see no reason to limit general calculations and memory access.
    | - The transaction's storage operands may not access more than
    | four octowords.
    8 cache lines participate, an unbounded number of cache lines
    can be accessed as long as participants is no larger than 8.
    | - The transaction may not access storage operands in any 4 |K-
    | byte blocks that contain the 256 bytes of storage beginning
    | with the TBEGINC instruction.
    interdesting.
    | - Operand references must be within a single doubleword,
    | except for some of the "multiple" instructions for which the
    | limitation is a single octoword.
    Any normal memory references to the participating lines.

    I think I read that the first implementation made an optimistic
    attempt and later — I do not remember if multiple optimistic
    attempts were made — a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    If you take the necessary 6 months to slug through all issues
    you can find solutions for the disjoint participants to be at
    least as large as the outstanding Miss Buffer size (or MB-1).

    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    If you do it right, your architecture sets up failure paths,
    so that if failure happens, IP reverts to the failure point
    without executing a branch instruction. I have an instruction
    that samples 'interference' and changes the failure point as
    a necessary addition. Any interrupt or exception transfers
    control to failure point before performing exception control
    transfer.

    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    The thing that makes this so difficult is that most µArchitectures
    cannot guarantee that 2 cache lines are ever simultaneously present
    in the cache. ASF and ESM have means to do this which greatly
    strengthens the guarantee of forward progress.

    My 66000 includes priority in memory transactions, and this enables
    the cache with write permission to determine to allow the request
    or to fail the request (request is at equal or lower priority) thus
    allowing the higher priority ATOMIC event to make forward progress
    at the expense of the lower priority event.

    At certain times the core may be in a position where it can finish
    an event if the cache lines can e guaranteed. During this period,
    a core can NaK a request so that the event is guaranteed to finish.

    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    It is much worse than that in practice. The interconnect protocol and
    the cache coherence model HAVE to HAVE ATOMIC event forward progress
    fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.

    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    See Lamport...

    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.

    ...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 13:52:39 2026
    From Newsgroup: comp.arch

    On 6/1/2026 6:38 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
    On 5/20/2026 4:47 PM, Paul Clayton wrote:
    On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
    CAS failures, I have tested this in the past, will hit the
    bus lock and still make forward progress... Sigh... A
    horrible LL/SC thing can live lock!

    LL/SC live lock is implementation dependent. One could
    Architecturally guarantee forward progress for the kind of cases
    where CAS would be an alternative.

    In my opinion, this is not so much a CAS vs. LL/SC issue as a
    quality of implementation issue.

    Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
    guarantees. Using LL/SC to emulate them is a different story.

    Academic LL/SC: I can agree with this statement. But neither ASF nor
    ESM has problems making stronger guarantees--and I did this over
    {7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
    impose limitation on instruction order and SW has to understand
    several nonVoneumann properties of the ATOMIC event.

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    That standard academic stuff cannot, does not mean it absolutely
    cannot be done.

    IBM's constrained
    transactions guaranteed success of a transaction if it met
    certain criteria. A single-instruction LL/SC body could be
    Architecturally guaranteed to perform not only successfully but
    with some performance characteristics.

    A guarantee of forward progress is not very useful if the
    progress is glacially (or cosmologically) slow. ("We guarantee
    that the operation will complete before the heat death of the
    universe"☺)

    A _guarantee_ of forward progress is ALWAYS important? Sorry for
    shouting. Shit. Knowing the size of the reservation granule is
    hyper important to help the software pad and align to remove any
    false sharing on said granule. No? But...

    I disagree. A guarantee that has a time scale beyond human
    civilization much less the lifetime of the hardware seems to
    have extremely little use. It may be reasonable to assume
    reasonable timescales for such guarantees, but a simple
    guarantee of eventual completion (if the system is kept
    operating) might be given if the profit motive seems sufficient.

    (I am not certain if even x86 XLOCK operations are absolutely
    guaranteed to complete in the presence of context switches. A
    hardware thread might be always be interrupted while it is
    performing the operation and if the hardware does not delay
    interrupt handling until after the operation completes, then the
    operation may never complete. This may be so extraordinarily
    improbable that an undetected error in ECC-protected memory
    might be more likely, in which case it is not really important.)

    I think one really wants the time scale explicitly declared as
    well as information about the range of latency and causes. Even
    5ms latency can seem like forever.

    Here's the deeper problem can rear its ugly head... Vendors
    often don't document it? Or they document it inconsistently
    across revisions? So even if you do everything right in
    principle, you're tuning against a number you had to dig out of
    a forum post or reverse engineer yourself. Scary! ;^o

    Ugh!

    Architecting a lot of such factors might help with documentation
    as Architecture is more stable than microarchitecture, but I do
    not think typical companies have the incentives for excellence
    in documentation. If the only consequence of mistakes in
    Architectural documentation is a few software developers
    grumbling, keeping even such stable documentation consistent and
    correct (and abiding by the old/existing Architectural contract)
    seems unlikely to seem important. In fact, if the inability to
    optimize forces people to buy more (or more expensive) hardware,
    poor documentation can mean higher profits.

    It took me more than 35 years to learn how to write µArchitecture
    documents such that a malevolent engineer could not misunderstand
    what was written and specified. Try it, it is not easy. It is not
    something that can be taught, but it is something that diligence
    and perseverance can deliver.

    Of course, the temptation toward "good enough" (not so bad
    that one will lose too many customers) is a problem. I would
    expect
    documented guarantees of sufficient generality to have the
    cognitive load for software developers be acceptable. That
    such guarantees seem to be very rare is sad.

    How many SC failures on a fetch-and-add are acceptable before
    you conclude something's fundamentally broken? For me the answer
    is: very few.

    How many SC failures are acceptable if there are 1024 cores all
    going after the same lock ??

    Again, I think this is concerned with "quality of
    implementation" (and Architectural guarantees about such) than
    about the interface at an instruction level.

    Simple... Do NOT allow 1024 cores to hammer a single location!

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 14:15:24 2026
    From Newsgroup: comp.arch

    On 6/2/2026 12:36 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    The core is going to package this instruction up and ship it
    across the interconnect as a fire-and-forget transaction.

    The interconnect is going to route the package towards either a
    cache having write permission or a control register.

    The cache or control register will perform the packaged calculation
    and optionally send back the previous value.

    The core receives the optional previous value and the memory-atomic
    is complete:: 2 interconnect messages, both smaller than a cache line,
    not cache lines are moved, and the calculation cannot fail. The only
    failure mode is if the interconnect message fails ECC check in either directions.

    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees,

    If so, you will be surprised when you implement one.

    even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Where it becomes cubically harder.

    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    SW TM wants the TM model to support an unbounded number of memory
    elements in the single transaction. HW does not do unbounded.

    Constrained transactions had these restrictions (from
    https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
    | - The transaction executes no more than 32 instructions.
    I used a timer--to the same ends.
    | - All instructions within the transaction must be within 256
    | contiguous bytes of storage.
    I allow calls to subroutines in the event.
    | - The only branches you may use are relative branches that
    | branch forward (so there can be no loops).
    Loops are OK as long as the timer does not go off.
    | - All SS and SSE-format instructions may not be used.
    Agreed.
    | - Additional general instructions may not be used.
    I see no reason to limit general calculations and memory access.
    | - The transaction's storage operands may not access more than
    | four octowords.
    8 cache lines participate, an unbounded number of cache lines
    can be accessed as long as participants is no larger than 8.
    | - The transaction may not access storage operands in any 4 |K-
    | byte blocks that contain the 256 bytes of storage beginning
    | with the TBEGINC instruction.
    interdesting.
    | - Operand references must be within a single doubleword,
    | except for some of the "multiple" instructions for which the
    | limitation is a single octoword.
    Any normal memory references to the participating lines.

    I think I read that the first implementation made an optimistic
    attempt and later — I do not remember if multiple optimistic
    attempts were made — a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    If you take the necessary 6 months to slug through all issues
    you can find solutions for the disjoint participants to be at
    least as large as the outstanding Miss Buffer size (or MB-1).

    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    If you do it right, your architecture sets up failure paths,
    so that if failure happens, IP reverts to the failure point
    without executing a branch instruction. I have an instruction
    that samples 'interference' and changes the failure point as
    a necessary addition. Any interrupt or exception transfers
    control to failure point before performing exception control
    transfer.

    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    The thing that makes this so difficult is that most µArchitectures
    cannot guarantee that 2 cache lines are ever simultaneously present
    in the cache. ASF and ESM have means to do this which greatly
    strengthens the guarantee of forward progress.

    My 66000 includes priority in memory transactions, and this enables
    the cache with write permission to determine to allow the request
    or to fail the request (request is at equal or lower priority) thus
    allowing the higher priority ATOMIC event to make forward progress
    at the expense of the lower priority event.

    At certain times the core may be in a position where it can finish
    an event if the cache lines can e guaranteed. During this period,
    a core can NaK a request so that the event is guaranteed to finish.

    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    It is much worse than that in practice. The interconnect protocol and
    the cache coherence model HAVE to HAVE ATOMIC event forward progress
    fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.

    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    See Lamport...

    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.

    ...

    Well, we can do something... we know that lock cmpxchg8b on a 32 bit
    system can handle two adjacent cache lines. So, we can try to hold more
    than that, but! its not ideal. For instance my multex can do it and
    emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 14:20:44 2026
    From Newsgroup: comp.arch

    On 6/2/2026 2:15 PM, Chris M. Thomasson wrote:
    On 6/2/2026 12:36 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 6/1/26 9:27 PM, MitchAlsup wrote:
    [snip]
    But note: XADD [...] never causes more than necessary bus traffic

    I am skeptical that this is Architecturally guaranteed. It may
    fall out of any even semi-sane implementation, in which case
    programmers might be willing to take it as guaranteed. Yet I
    suspect "sanity" may not be reliable with changing tradeoffs
    (including whether protecting a company's reputation has value).

    The core is going to package this instruction up and ship it
    across the interconnect as a fire-and-forget transaction.

    The interconnect is going to route the package towards either a
    cache having write permission or a control register.

    The cache or control register will perform the packaged calculation
    and optionally send back the previous value.

    The core receives the optional previous value and the memory-atomic
    is complete:: 2 interconnect messages, both smaller than a cache line,
    not cache lines are moved, and the calculation cannot fail. The only
    failure mode is if the interconnect message fails ECC check in either
    directions.
    and as an atomic event, never fails, never needs retry, ...

    I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
    etc.) could provide such guarantees,

    If so, you will be surprised when you implement one.

                                          even extending to multiple
    contiguous instructions operating on data within an aligned
    64-byte region.

    Where it becomes cubically harder.
    Interestingly, it seems that IBM's z17 is the last
    implementation to support constrained transactions. I do wonder
    why this feature has been removed from the Architecture.

    SW TM wants the TM model to support an unbounded number of memory
    elements in the single transaction. HW does not do unbounded.

    Constrained transactions had these restrictions (from
    https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-
    transactions):
    | - The transaction executes no more than 32 instructions.
    I used a timer--to the same ends.
    | - All instructions within the transaction must be within 256
    |   contiguous bytes of storage.
    I allow calls to subroutines in the event.
    | - The only branches you may use are relative branches that
    |   branch forward (so there can be no loops).
    Loops are OK as long as the timer does not go off.
    | - All SS and SSE-format instructions may not be used.
    Agreed.
    | -  Additional general instructions may not be used.
    I see no reason to limit general calculations and memory access.
    | - The transaction's storage operands may not access more than
    |   four octowords.
    8 cache lines participate, an unbounded number of cache lines
    can be accessed as long as participants is no larger than 8.
    | - The transaction may not access storage operands in any 4 |K-
    |   byte blocks that contain the 256 bytes of storage beginning
    |   with the TBEGINC instruction.
    interdesting.
    | - Operand references must be within a single doubleword,
    |   except for some of the "multiple" instructions for which the
    |   limitation is a single octoword.
    Any normal memory references to the participating lines.

    I think I read that the first implementation made an optimistic
    attempt and later — I do not remember if multiple optimistic
    attempts were made — a hardware lock was used. Perhaps four
    addresses cause too much of a slowdown when there is conflict???

    I believe that guaranteeing completion would be substantially
    easier with only one aligned 64-byte region. (As I think I
    wrote before, adding a single "word" exportable atomic operation
    in a different "cache block" _might_ be practical to implement
    though I did not have an idea for software would express such.
    I may be wrong that appending such an exportable operation would
    not make ensuring completion significantly more difficult.)

    If you take the necessary 6 months to slug through all issues
    you can find solutions for the disjoint participants to be at
    least as large as the outstanding Miss Buffer size (or MB-1).
    I think such guaranteed atomic sequences would require a
    distinct instruction not only to allow what IBM did (making such
    an illegal/faulting instruction) but also to fault when the
    instruction is misused since no fallback path is provided.

    If you do it right, your architecture sets up failure paths,
    so that if failure happens, IP reverts to the failure point
    without executing a branch instruction. I have an instruction
    that samples 'interference' and changes the failure point as
    a necessary addition. Any interrupt or exception transfers
    control to failure point before performing exception control
    transfer.
    There also seem to be other operations that would not (I think)
    be exceptionally difficult to guarantee. E.g., swapping cache
    blocks might not be much more difficult to guarantee than quick
    operations within a single cache block, though I do not know
    how useful such an unconditional swap would be. Atomic cache
    block copy would seem to be easier (it is similar to a block
    zeroing instruction except that the value is taken from a block
    that is not writeable by other agents being in exclusive or
    shared state). Guaranteeing atomicity for a copy into a cache
    block (where two contiguous cache blocks might be in the read
    set and the write is only to part of a cache block) seems a
    little more complicated.

    The thing that makes this so difficult is that most µArchitectures
    cannot guarantee that 2 cache lines are ever simultaneously present
    in the cache. ASF and ESM have means to do this which greatly
    strengthens the guarantee of forward progress.

    My 66000 includes priority in memory transactions, and this enables
    the cache with write permission to determine to allow the request
    or to fail the request (request is at equal or lower priority) thus
    allowing the higher priority ATOMIC event to make forward progress
    at the expense of the lower priority event.

    At certain times the core may be in a position where it can finish
    an event if the cache lines can e guaranteed. During this period,
    a core can NaK a request so that the event is guaranteed to finish.
    With conventional cache coherence, partial writes seem likely to
    be complex. If masked cache block updates were possible as an
    exportable atomic operation, it might be practical to lock (NAK-
    guard) a limited read set and push the update to the owner. I do
    not know if such an update independent of previous values in the
    written cache block would be useful.

    It is much worse than that in practice. The interconnect protocol and
    the cache coherence model HAVE to HAVE ATOMIC event forward progress
    fully integrated. MESI and MOESI are insufficient here; most directory
    coherence protocols are also insufficient.
    I am certainly not comfortable thinking about the visibility/
    ordering constraints, so my guesses are very wrong about what is
    practical to guarantee as atomic.

    See Lamport...
    Even if an operation can practically be guaranteed, it may not
    be worthwhile to provide an interface that allows requesting
    such a guaranteed atomic operation.

    ...

    Well, we can do something... we know that lock cmpxchg8b on a 32 bit
    system can handle two adjacent cache lines. So, we can try to hold more
    than that, but! its not ideal. For instance my multex can do it and
    emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/ sV4WC_cBb9Q/m/SkSqpSxGCAAJ


    I think that is why AMD allowed for LOCK RMW along with LL/SC?!
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Tue Jun 2 17:11:11 2026
    From Newsgroup: comp.arch

    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    Now, there was no thought of hundreds (or thousands) of CPU's. But
    some of the pessimistic assumptions you might make of LL/SC (at least
    as available in MIPS CPU's of that era) might need to be
    revisited. Our best analysis said it would scale to very large
    (for that time) database workloads.

    Finances and other management things cancelled the program. Sequent
    eventually went with their NUMA, ultimately being acquired by IBM. We
    never found out how that system would've done in the real world.

    I seem to remember its code name was "Model R" (RISC).

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Jun 3 18:19:28 2026
    From Newsgroup: comp.arch

    Paul Clayton <[email protected]> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    Let's see:

    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 3 12:57:42 2026
    From Newsgroup: comp.arch

    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    Paul Clayton <[email protected]> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    Let's see:

    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW, regardless of the ins and outs of LL/SC vs LOCK RMW. Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread counter and summing them when we need to observe the actual count?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Jun 3 20:53:49 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW, >regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread >counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 3 15:15:53 2026
    From Newsgroup: comp.arch

    On 6/3/2026 1:53 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW,
    regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread
    counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    Humm... I missed that. Anyway, you need to test them multi threaded...
    Say our counters are per thread so an increment adds to its per-thread
    counter instead of using a LOCK RMW. Then when the counter needs to be
    sampled we can start summing up the per thread counts...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 3 15:23:43 2026
    From Newsgroup: comp.arch

    On 6/3/2026 3:15 PM, Chris M. Thomasson wrote:
    On 6/3/2026 1:53 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
          1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
          1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
          1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
          1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

       !@   +!@
       7.5  7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

       !@   +!@
       8.5  7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW,
    regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread >>> counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    Humm... I missed that. Anyway, you need to test them multi threaded...
    Say our counters are per thread so an increment adds to its per-thread counter instead of using a LOCK RMW. Then when the counter needs to be sampled we can start summing up the per thread counts...


    It can be amortized in different ways. Per thread is pretty damn lean
    and mean! ;^) Or we can have some tables of counters aligned and padded.
    So, a thread can increment its assigned counter instead of its
    per-thread count, or vise versa. But, the idea is to distribute things
    so a shit load of threads are not hammering a single location.

    It depends on the type of data or what the counters are being used for.
    We can read them using std:memory_order_relaxed loads.

    Thread 1: [ Counter A ] --> Relaxed Increment (No LOCK)
    Thread 2: [ Counter B ] ---> Relaxed Increment (No LOCK)
    Thread 3: [ Counter C ] ---> Relaxed Increment (no LOCK)
    ^
    Sampling Thread: -------------------+ (Loops through with relaxed loads)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Thu Jun 4 14:21:16 2026
    From Newsgroup: comp.arch

    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
    MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
    SPP. After evaluation, we chose Pentium Pro to build the system
    (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Jun 4 10:23:36 2026
    From Newsgroup: comp.arch

    On 2026-Jun-03 14:19, Anton Ertl wrote:
    Paul Clayton <[email protected]> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    Let's see:

    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    - anton

    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.
    CMPXCHG does not do this - to be atomic it must have a LOCK prefix.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Jun 4 10:25:06 2026
    From Newsgroup: comp.arch

    On 2026-Jun-03 16:53, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 6/3/2026 11:19 AM, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;


    : bench-+!@
    1 5000000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
    (fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    Hammering a single location is going to be bad for LL/SC or LOCK RMW,
    regardless of the ins and outs of LL/SC vs LOCK RMW.

    It's two locations in these benchmarks: X and Y.

    Its up to the
    programmer to make sure that is amortized, distributed in clever ways.
    For instance, why use a single atomic counter, vs say using a per thread
    counter and summing them when we need to observe the actual count?

    These benchmarks use per-thread storage: They are single-threaded.

    - anton

    They might be allocated in the same cache line.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Thu Jun 4 21:04:28 2026
    From Newsgroup: comp.arch

    EricP <[email protected]> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov 0x8(%rbx),%r15
    mov %r13,%rax
    mov (%r15),%r13
    mov %rax,(%r15)

    while the code for "x atomic!@" is:

    mov %r13,(%r10)
    sub $0x8,%r10
    mov 0x8(%rbx),%r13
    mov 0x8(%r10),%rax
    add $0x8,%r10
    xchg %rax,0x0(%r13)
    mov %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 18:28:43 2026
    From Newsgroup: comp.arch

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
    SPP. After evaluation, we chose Pentium Pro to build the system
    (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was pretty
    fast for certain worksets and algorithms. RMO mode.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 18:33:41 2026
    From Newsgroup: comp.arch

    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <[email protected]> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov 0x8(%rbx),%r15
    mov %r13,%rax
    mov (%r15),%r13
    mov %rax,(%r15)

    while the code for "x atomic!@" is:

    mov %r13,(%r10)
    sub $0x8,%r10
    mov 0x8(%rbx),%r13
    mov 0x8(%r10),%rax
    add $0x8,%r10
    xchg %rax,0x0(%r13)
    mov %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    How is your data organized? Show me the structure?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 21:20:20 2026
    From Newsgroup: comp.arch

    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <[email protected]> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
    1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov 0x8(%rbx),%r15
    mov %r13,%rax
    mov (%r15),%r13
    mov %rax,(%r15)

    while the code for "x atomic!@" is:

    mov %r13,(%r10)
    sub $0x8,%r10
    mov 0x8(%rbx),%r13
    mov 0x8(%r10),%rax
    add $0x8,%r10
    xchg %rax,0x0(%r13)
    mov %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    XCHG does have the implied LOCK as EricP mentioned.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 22:56:47 2026
    From Newsgroup: comp.arch

    On 6/4/2026 6:33 PM, Chris M. Thomasson wrote:
    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <[email protected]> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !

    : bench-!@
          1 5000000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
          1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
    ...
    On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
    whether it is specified or not. In your example both are atomic.

    The code for "x !@" is:

    mov    0x8(%rbx),%r15
    mov    %r13,%rax
    mov    (%r15),%r13
    mov    %rax,(%r15)

    while the code for "x atomic!@" is:

    mov    %r13,(%r10)
    sub    $0x8,%r10
    mov    0x8(%rbx),%r13
    mov    0x8(%r10),%rax
    add    $0x8,%r10
    xchg   %rax,0x0(%r13)
    mov    %rax,%r13

    As you can see, there is no XCHG in the !@ code.

    How is your data organized? Show me the structure?

    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 07:04:17 2026
    From Newsgroup: comp.arch

    [email protected] (Anton Ertl) writes:
    Paul Clayton <[email protected]> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.

    I have increased the loop count by a factor of 10, because I did not
    subtract the startup overhead of Gforth; as a result, the startup
    overhead is reduced from 3.3 cycles per execution of the relevant word
    to 0.33 cycles.

    I have also inserted 64 bytes between the variables, so that they are
    in different cache lines. This should not make a difference, because
    all accesses are in the same thread (i.e., no cache-ping-pong from
    possible false sharing), but just in case.

    What I did not do is to use several threads. The idea here is that
    programmers will take measures that ensure that contention is rare,
    but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
    barriers have no extra cost, but in reality, they do have extra cost.
    If you are interested in seeing data for the contended case, look at
    the cache ping-pong benchmarks, e.g., on chipsandcheese. There is one
    danger in my approach: Hardware could have a special optimization for
    memory that is not shared between threads at all, and run slower if
    the memory is shared, but not contended; I have never read about such
    a mechanism, and I'll leave checking the performance with multiple non-contending threads for another day.

    The source code now is:

    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    : bench-!@
    1 50_000_000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 50_000_000 0 do x atomic!@ y atomic!@ loop drop ;

    : bench-+!@
    1 50_000_000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 50_000_000 0 do x atomic+!@ y atomic+!@ loop drop ;

    : bench-nobarrier
    50_000_000 0 do x @ y @ 2drop loop ;

    : bench-barrier
    50_000_000 0 do x @ barrier y @ barrier 2drop loop ;

    The results are:

    Ryzen 8700G (Zen4):
    !@ +!@ barr
    2.4 2.4 1.8 no atomic/no barrier
    9.2 8.3 7.1 atomic/barrier

    Ryzen 3900X (Zen2; in contrast to the 8700G with 1 CCX, the 3900X has
    4 CCXs that may need coordination):
    !@ +!@ barr
    2.9 4.5 2.2 no atomic/no barrier
    19.1 19.0 17.5 atomic/barrier

    Given that the cycles here are far below the cycles reported for
    Inter-CCX cache ping-pong, I guess that there is no inter-CCX
    communication (at least no bidirectional one) in this benchmark.

    On to Intel:
    Core i3-1315U P-core (Golden Cove):
    !@ +!@ barr
    1.9 1.9 1.5 no atomic/no barrier
    19.4 20.9 27.9 atomic/barrier

    Core i3-1315U E-core (Gracemont):
    !@ +!@ barr
    2.7 2.2 2.2 no atomic/no barrier
    20.6 20.4 20.0 atomic/barrier

    On to Apple Silicon (weak memory ordering by default):
    Apple M1 P-core (Firestorm):
    !@ +!@ barr
    3.6 3.6 3.5 no atomic/no barrier
    31.9 31.5 3.6 atomic/barrier

    Apple M1 E-core (Icestorm):
    !@ +!@ barr
    3.4 3.4 3.4 no atomic/no barrier
    31.4 32.9 6.9 atomic/barrier

    On to ARM (weak memory ordering):
    RK3588 big (Cortex-A76):
    !@ +!@ barr
    3.3 3.6 3.3 no atomic/no barrier
    20.3 20.4 13.2 atomic/barrier

    RK3588 little (Cortex-A55):
    !@ +!@ barr
    7.2 9.2 7.2 no atomic/no barrier
    68.1 57.1 16.2 atomic/barrier

    I find the cheapness of the barrier on the M1 surprising. I would
    have expected that barriers are more expensive on hardware where the architecture allows more reordering and the hardware makes use of that
    license (and I think that the M1 does make use of it).

    OTOH, the atomic stuff is more expensive on the Apple M1 and the ARM
    cores than on the Intel and AMD cores (note that the cycle times of
    the Intel and AMD cores used here is quite a bit shorter than for
    Apple and ARM cores, except for Gracemont compared to Firestorm; but
    for Firestorm the number of cycles executed is higher, so Gracemont
    still takes less time.

    In conclusion, as long as we have no contention, atomic accesses and
    barriers do not cost hundreds of cycles, but they do cost enough extra
    (except the barrier on Firestorm, at least in the present benchmark)
    that one does not want to use them across the board, only when
    accessing memory that another thread accesses, too. At least in this
    sample of cores, the atomic instructions are faster on Intel and AMD
    cores than on Apple and ARM cores; for the barrier, the costs are
    usually not higher and sometimes significantly cheaper than for the
    atomic instructions.

    - anton
















    On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ >(fetch-and-add) costs the following numbers of cycles (including
    overhead):

    !@ +!@
    7.5 7.3 not atomic
    14.2 13.2 atomic

    On a Xeon E-2388G (Rocket Lake):

    !@ +!@
    8.5 7.1 not atomic
    25.8 26.6 atomic

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 09:04:51 2026
    From Newsgroup: comp.arch

    [email protected] (Scott Lurndal) writes:
    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC.

    I remember listening to a presentation by a student of a collegue
    about implementing garbage collection for IIRC big SGI machines. In
    addition to LL/SC, they had atomic stuff stuch as fetch-and-add
    implemented in the memory subsystem, not in the processor, and that
    apparently was needed for contended cases to avoid the round-trip time
    through the caches of individual processors. My understanding is
    that, while viewed from the perspective of an individual core, the
    atomic instructions were slow, the throughput in the contended case
    was significantly higher than with LL/SC or an atomic mechanism
    implemented in the individual CPUs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 09:12:03 2026
    From Newsgroup: comp.arch

    EricP <[email protected]> writes:
    These benchmarks use per-thread storage: They are single-threaded.
    ...
    They might be allocated in the same cache line.

    Given that they are accessed by the same thread, I don't expect that
    to hurt, but I did separate the variables by at least 64 bytes in my
    recent runs just in case.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 09:14:29 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    On 6/4/2026 2:04 PM, Anton Ertl wrote:
    EricP <[email protected]> writes:
    On 2026-Jun-03 14:19, Anton Ertl wrote:
    variable x 1 x !
    variable y -1 y !
    ...
    How is your data organized? Show me the structure?

    Shown above. Or, in today's testing:

    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 10:20:30 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?

    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).

    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    !@ +!@ barr
    2.4 2.4 1.8 A B C
    2.4 2.4 1.9 D E

    For the atomic/barrier variants the cycles are:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.2 8.3 8.5-11.2 C
    9.3 8.3 9.1-11 D
    9.1 8.3 7.3-11 E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that. The other columns show only small variations.
    In any case the aligning and padding recommended by you is not
    superior to the original code, which just uses two variables.

    Here's the code:

    1 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    : cache-align here dup 64 naligned >align ;
    cache-align
    here 1 , cache-align here -1 , constant y constant x
    [endif]

    The part before the [else] is A, comment out "64 allot" for B.

    The part after the [else] is D, delete the second CACHE-ALIGN for C,
    and replace it with "64 allot" for E.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Fri Jun 5 13:43:11 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
    MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
    SPP. After evaluation, we chose Pentium Pro to build the system
    (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was pretty
    fast for certain worksets and algorithms. RMO mode.

    Both technical and business reasons.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Fri Jun 5 17:02:24 2026
    From Newsgroup: comp.arch

    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we
    investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor SPP. After evaluation, we chose Pentium Pro to build the
    system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC. SPARC never
    made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
    were UrtraSPARC and UrtraSPARC II.
    Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
    to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very
    surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store
    operations operations - they behave as RMO regardles of processor's
    global memory mode.










    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Fri Jun 5 07:07:07 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    I don't recall the details of the MIPS evaluation, but we were concerned
    at the time about the scalability of LL/SC. SPARC never made it out
    of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was pretty
    fast for certain worksets and algorithms. RMO mode.

    Sun came through Cisco as well, I don't recall which generation of
    chips, but I remember their focus was on the interface to memory
    itself, targeting radically reduced latency and much higher bandwidth.
    We weren't sure they would get their design out the door, and we were
    pretty sure indeed that they wouldn't make a good enough embedded
    CPU for our purposes. Too big, too hot, too expensive, and so forth.

    At that time (MANY years ago now) Cisco's core router OS was big endian
    only. That kept us from considering x86.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 14:57:46 2026
    From Newsgroup: comp.arch

    On 6/5/2026 2:12 AM, Anton Ertl wrote:
    EricP <[email protected]> writes:
    These benchmarks use per-thread storage: They are single-threaded.
    ...
    They might be allocated in the same cache line.

    Given that they are accessed by the same thread, I don't expect that
    to hurt, but I did separate the variables by at least 64 bytes in my
    recent runs just in case.

    Make sure to pad and align the variables on separate cache lines. :^)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:11:22 2026
    From Newsgroup: comp.arch

    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?

    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).

    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    !@ +!@ barr
    2.4 2.4 1.8 A B C
    2.4 2.4 1.9 D E

    For the atomic/barrier variants the cycles are:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.2 8.3 8.5-11.2 C
    9.3 8.3 9.1-11 D
    9.1 8.3 7.3-11 E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that. The other columns show only small variations.
    In any case the aligning and padding recommended by you is not
    superior to the original code, which just uses two variables.

    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions
    Multi-threaded pad and align to prevent false sharing between
    independently accessed variables.

    For instance you don't want a mutex word to false share with say an
    atomic counter that has nothing to do with the mutex. They need to be
    padded and aligned...


    Here's the code:

    1 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    : cache-align here dup 64 naligned >align ;
    cache-align
    here 1 , cache-align here -1 , constant y constant x
    [endif]

    The part before the [else] is A, comment out "64 allot" for B.

    The part after the [else] is D, delete the second CACHE-ALIGN for C,
    and replace it with "64 allot" for E.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:27:04 2026
    From Newsgroup: comp.arch

    On 6/5/2026 12:04 AM, Anton Ertl wrote:
    [email protected] (Anton Ertl) writes:
    Paul Clayton <[email protected]> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.

    I have increased the loop count by a factor of 10, because I did not
    subtract the startup overhead of Gforth; as a result, the startup
    overhead is reduced from 3.3 cycles per execution of the relevant word
    to 0.33 cycles.

    I have also inserted 64 bytes between the variables, so that they are
    in different cache lines. This should not make a difference, because
    all accesses are in the same thread (i.e., no cache-ping-pong from
    possible false sharing), but just in case.

    What I did not do is to use several threads. The idea here is that programmers will take measures that ensure that contention is rare,
    but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
    barriers have no extra cost, but in reality, they do have extra cost.

    Indeed.


    [snip results]


    Thanks.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:40:13 2026
    From Newsgroup: comp.arch

    On 6/5/2026 12:04 AM, Anton Ertl wrote:
    [email protected] (Anton Ertl) writes:
    Paul Clayton <[email protected]> writes:
    I seem to recall reading that x86's LOCK instructions take
    hundreds of cycles. While some of this is probably from stronger
    memory ordering guarantees, I get the impression that the
    operation itself is not aggressively optimized.

    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.
    [...]

    On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
    per thread stack location? Iirc some compilers would use a dummy. Oh
    shit man, 20+ish years ago I was running all sorts of benchmarks on
    MFENCE vs LOCK RMW. Or MFENCE vs MEMBAR #StoreLoad | #LoadStore |
    #StoreStore | #LoadLoad on the SPARC. I could not really directly test
    LOCK RMW wrt x86 on the SPARC because all of the sparcs aromic RMW's are naked. I would have to manually add the barriers to make it TSO in RMO mode. --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:43:14 2026
    From Newsgroup: comp.arch

    On 6/5/2026 3:11 PM, Chris M. Thomasson wrote:
    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    // padded to a l2 cache line
    struct A
    {
         unsigned word m_data;
         char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
         unsigned word m_data;
         char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?

    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today).  A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).

    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

       !@   +!@ barr
       2.4  2.4  1.8 A B C
       2.4  2.4  1.9 D E

    For the atomic/barrier variants the cycles are:

       !@   +!@ barr
       9.3  8.3  7.2 A
       9.2  8.3  7.1 B
       9.2  8.3  8.5-11.2 C
       9.3  8.3  9.1-11   D
       9.1  8.3  7.3-11   E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that.  The other columns show only small variations.
    In any case the aligning and padding recommended by you is not
    superior to the original code, which just uses two variables.

    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions

    Actually try to avoid LOCK prefixed anything on single threaded... Even
    XCHG has that implied LOCK prefix. :^)



    Multi-threaded pad and align to prevent false sharing between
    independently accessed variables.

    For instance you don't want a mutex word to false share with say an
    atomic counter that has nothing to do with the mutex. They need to be
    padded and aligned...


    Here's the code:

    1 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
         : cache-align here dup 64 naligned >align ;
         cache-align
         here 1 , cache-align here -1 , constant y constant x
    [endif]

    The part before the [else] is A, comment out "64 allot" for B.

    The part after the [else] is D, delete the second CACHE-ALIGN for C,
    and replace it with "64 allot" for E.




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 16:06:43 2026
    From Newsgroup: comp.arch

    On 6/5/2026 7:02 AM, Michael S wrote:
    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we
    investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
    processor SPP. After evaluation, we chose Pentium Pro to build the
    system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC. SPARC never
    made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
    were UrtraSPARC and UrtraSPARC II.

    Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up.

    Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
    defines three memory models: TSO, PSO, and RMO.

    It still needed an explicit membar for a store followed by a load to
    another location, even in TSO.

    Actually, I forgot how I go some sparcs in RMO mode. PSTATE?


    Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
    to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store operations operations - they behave as RMO regardles of processor's
    global memory mode.

    Remember that old thing in one of the SPARC docs that explicitly
    mentioned to NEVER put a MEMBAR instruction in the branch delay slot?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 16:08:46 2026
    From Newsgroup: comp.arch

    On 6/5/2026 4:06 PM, Chris M. Thomasson wrote:
    On 6/5/2026 7:02 AM, Michael S wrote:
    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS.  We looked at LL/SC really, really hard.  Lock traces
    from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's.  And >>>>> that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe;  we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110.   So we
    investigated MIPS, SPARC and Pentium Pro.  Our target was for a 64+
    processor SPP.  After evaluation, we chose Pentium Pro to build the
    system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC.   SPARC never
    made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
    were UrtraSPARC and UrtraSPARC II.

    Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed
    up.

    Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
    defines three memory models: TSO, PSO, and RMO.

    It still needed an explicit membar for a store followed by a load to
    another location, even in TSO.

    Actually, I forgot how I go some sparcs in RMO mode. PSTATE?


    Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
    to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very
    surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store
    operations operations - they behave as RMO regardles of processor's
    global memory mode.

    Remember that old thing in one of the SPARC docs that explicitly
    mentioned to NEVER put a MEMBAR instruction in the branch delay slot?



    I would always program the sparc (ASM using GAS) using the correct
    membars in the right places even if on certain modes they would be no-ops.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 16:17:05 2026
    From Newsgroup: comp.arch

    On 6/5/2026 4:06 PM, Chris M. Thomasson wrote:
    [...]

    Fwiw, the SunFire T2000 was the first sparc box I owned personally. Sun
    gave me one in the their CoolThreads contest for my vzoom project. I
    have used others before that, but they were not mine.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Jun 6 01:44:09 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <[email protected]> posted:

    On 6/5/2026 7:02 AM, Michael S wrote:
    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel
    onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
    that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we
    investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
    processor SPP. After evaluation, we chose Pentium Pro to build the
    system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC. SPARC never
    made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
    were UrtraSPARC and UrtraSPARC II.

    Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up.

    Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
    defines three memory models: TSO, PSO, and RMO.

    It still needed an explicit membar for a store followed by a load to
    another location, even in TSO.

    Actually, I forgot how I go some sparcs in RMO mode. PSTATE?


    Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
    to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store operations operations - they behave as RMO regardles of processor's
    global memory mode.

    Remember that old thing in one of the SPARC docs that explicitly
    mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

    SPARC used nullification in delay slots.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat Jun 6 08:14:17 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    On 6/5/2026 12:04 AM, Anton Ertl wrote:
    [email protected] (Anton Ertl) writes:
    I have revised the benchmarks as follows: I have added a test of a
    memory barrier, which is implemented in GNU C as

    __atomic_thread_fence(__ATOMIC_SEQ_CST);

    The barriers separate loads.
    [...]

    On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
    per thread stack location?

    On AMD64, the latter. The code generated by gcc for the line above
    is:

    lock orq $0x0,(%rsp)

    On ARM A64 gcc generates the following:

    dmb ish

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat Jun 6 08:30:45 2026
    From Newsgroup: comp.arch

    [email protected] (Anton Ertl) writes:
    Anyway, let's see if it makes a difference.

    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).
    [...]
    And here are the results (on a Ryzen 8700G):

    The cycles per execution of the relevant word for the
    no-atomic/no-barrier variants are:

    !@ +!@ barr
    2.4 2.4 1.8 A B C
    2.4 2.4 1.9 D E

    For the atomic/barrier variants the cycles are:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.2 8.3 8.5-11.2 C
    9.3 8.3 9.1-11 D
    9.1 8.3 7.3-11 E

    The variatons for the barrier column are small for A and B (in the
    range 6.9-7.2), and quite a bit larger for C-E, and I have no
    explanation for that.

    Now I have: It's the placement of the native code. If I compile
    another definition

    : dummy1 swap over 2rot ;

    that is never called before all the others, the result for D becomes:

    !@ +!@ barr
    9.3 8.3 7.2 D

    with little variation. So it seems that the code placement of the bench-barrier word ran into some microarchitectural hickup of Zen4.

    Now that I have that problem worked around, let's see if the data
    placement makes a difference:

    !@ +!@ barr
    9.3 8.3 7.2 A
    9.2 8.3 7.1 B
    9.3 8.3 7.0 C
    9.3 8.3 7.2 D
    9.3 8.3 7.2 E

    Making them adjacent in the same cache line is not disadvantage as
    long as there is no actual communication going on. Of course, in an
    actual application you want them in different cache lines, because
    then you will have communication, or using atomic accesses or barrier
    would not be pointless.

    Code (with the data part set up for E):

    0 [if]
    variable x 1 x !
    64 allot \ make sure the variables are in different cache lines
    variable y -1 y !

    [else]
    : dummy1 swap over 2rot ;
    : cache-align here dup 64 naligned >align ;
    cache-align
    here 1 , ( cache-align ) 64 allot here -1 , constant y constant x
    [endif]

    : bench-!@
    1 50_000_000 0 do x !@ y !@ loop drop ;

    : bench-atomic!@
    1 50_000_000 0 do x atomic!@ y atomic!@ loop drop ;

    : bench-+!@
    1 50_000_000 0 do x +!@ y +!@ loop drop ;

    : bench-atomic+!@
    1 50_000_000 0 do x atomic+!@ y atomic+!@ loop drop ;

    : bench-nobarrier
    50_000_000 0 do x @ y @ 2drop loop ;

    : bench-barrier
    50_000_000 0 do x @ barrier y @ barrier 2drop loop ;

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat Jun 6 08:49:06 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <[email protected]> writes:
    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need
    to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?
    [...]
    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).
    ...
    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock.

    All of the data placement variants use word-aligned words and thus do
    not straddle cache lines. But your claim was that one should use only
    the first word in a cache line. Avoiding false sharing is important,
    if there is any sharing, but that's not the case for this benchmark.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 11:25:17 2026
    From Newsgroup: comp.arch

    On 6/5/2026 6:44 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/5/2026 7:02 AM, Michael S wrote:
    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel >>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces
    from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program,
    by implication bet the company) that it would work as efficiently
    as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>> that it was very likely to scale without undue incremental design
    work to ~32 CPU's.

    I was at Unisys in that same timeframe; we had planned on building
    the SPP (scalable parallel processor aka OPUS) using motorola 88110
    CPUs, until Apple went PPC and Moto canceled 88110. So we
    investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
    processor SPP. After evaluation, we chose Pentium Pro to build the
    system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC. SPARC never
    made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
    were UrtraSPARC and UrtraSPARC II.

    Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up. >>
    Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
    defines three memory models: TSO, PSO, and RMO.

    It still needed an explicit membar for a store followed by a load to
    another location, even in TSO.

    Actually, I forgot how I go some sparcs in RMO mode. PSTATE?


    Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented >>> to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very
    surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store
    operations operations - they behave as RMO regardles of processor's
    global memory mode.

    Remember that old thing in one of the SPARC docs that explicitly
    mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

    SPARC used nullification in delay slots.


    Iirc, might be wrong here, a MEMBAR can force processor serialization or
    stall the pipeline until the store buffers drain, executing it right
    when the processor is updating the PC and nPC for a branch created nasty timing hazards? God its been a long time since I read the docs...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 11:52:09 2026
    From Newsgroup: comp.arch

    On 6/6/2026 1:49 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 6/5/2026 3:20 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    // padded to a l2 cache line
    struct A
    {
    unsigned word m_data;
    char padding[...];
    };

    // padded to a l2 cache line
    struct B
    {
    unsigned word m_data;
    char padding[...];
    };


    Where A and B are both aligned up to a l2 cache line boundary? We need >>>> to pad _and_ align...

    Why would alignment to cache-line boundaries be necessary?
    [...]
    A) Word-aligned variable, 64 byte padding, another word-aligned
    variable (what I measured and posted today). A variable takes space
    not just for the data (one word), but also for the metadata (and the
    metadata is adjacent to the data).

    B) Word-aligned variables, no padding, word-aligned variable, with the
    two data words maybe in the same cache line, maybe not (measured
    yesterday).

    C) Cache-line-aligned word, no padding, another cache-line-aligned
    word (i.e., both words in the same cache line).

    D) Cache-line-aligned word, (56 bytes of) padding, another
    cache-line-aligned word.

    E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
    second word is aligned like in C).

    F) Word at offset 8 from a cache-line start, 48 bytes padding, another
    word (cache-line-aligned).
    ...
    Well, its mainly for false sharing in a multi threading environment. But
    it does matter a bit. If your variables straddle a cache line then it
    will trigger a bus lock.

    All of the data placement variants use word-aligned words and thus do
    not straddle cache lines. But your claim was that one should use only
    the first word in a cache line. Avoiding false sharing is important,
    if there is any sharing, but that's not the case for this benchmark.

    Fair enough! :^) For a single-threaded benchmark with no concurrent
    sharing, you are right. The layout variants you described ensure no
    single word straddles a cache-line boundary, which completely avoids the split-access or bus-lock penalty on a single core. In that specific
    context, packing things tightly is "superior" because my defensive
    padding would just bloat the working set and cause unnecessary cache misses.

    Fwiw, my advice to align and pad so a variable exclusively owns the
    first word of a cache line is a habit born entirely out of
    multi-threaded, lock/wait-free architecture design.

    Actually, there is a fundamental difference in intent:

    Word Alignment: Keeps a single thread from split-concurrency penalties (straddling). No word from cache line A bleeding into cache line B.

    Cache-Line Alignment + Padding: Keeps different threads on different
    cores from causing hardware cache-coherence storms (false sharing). Very
    bad!

    If struct A and struct B live in the exact same cache line, they are
    safe from straddling. But the moment Core 0 writes to A and Core 1
    writes to B, the underlying MESI cache-coherence protocol will violently bounce that single cache line back and forth between L1 caches.

    Since your benchmark doesn't have concurrent sharing, you only care
    about #1. I default to engineering for #2 defensively because the moment
    code scales out to multiple threads, a well-aligned but unpadded
    structure can cause performance to fall off a cliff.

    Actually, do you remember the thread offset fiasco from Intel? I
    remember reading a white paper wrt hyper threading, that the thread
    stacks should be offset from each other to avoid false sharing. It was a
    work around for a design error, iirc?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 12:03:46 2026
    From Newsgroup: comp.arch

    On 6/6/2026 11:25 AM, Chris M. Thomasson wrote:
    On 6/5/2026 6:44 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/5/2026 7:02 AM, Michael S wrote:
    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel >>>>>>> onto MIPS.  We looked at LL/SC really, really hard.  Lock traces >>>>>> >from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program, >>>>>>> by implication bet the company) that it would work as efficiently >>>>>>> as our current Intel atomics at up to 8-way 64-bit MIPS CPU's.  And >>>>>>> that it was very likely to scale without undue incremental design >>>>>>> work to ~32 CPU's.

    I was at Unisys in that same timeframe;  we had planned on building >>>>>> the SPP (scalable parallel processor aka OPUS) using motorola 88110 >>>>>> CPUs, until Apple went PPC and Moto canceled 88110.   So we
    investigated MIPS, SPARC and Pentium Pro.  Our target was for a 64+ >>>>>> processor SPP.  After evaluation, we chose Pentium Pro to build the >>>>>> system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC.   SPARC never >>>>>> made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware >>>> were UrtraSPARC and UrtraSPARC II.

    Oh shit, I think you are right! I sometimes get my old SPARC boxes
    mixed up.

    Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
    defines three memory models: TSO, PSO, and RMO.

    It still needed an explicit membar for a store followed by a load to
    another location, even in TSO.

    Actually, I forgot how I go some sparcs in RMO mode. PSTATE?


    Starting from UrtraSPARC III Cu, all Sun SPARC processors are
    documented
    to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very
    surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store
    operations operations - they behave as RMO regardles of processor's
    global memory mode.

    Remember that old thing in one of the SPARC docs that explicitly
    mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

    SPARC used nullification in delay slots.


    Iirc, might be wrong here, a MEMBAR can force processor serialization or stall the pipeline until the store buffers drain, executing it right
    when the processor is updating the PC and nPC for a branch created nasty timing hazards? God its been a long time since I read the docs...

    Or iirc, sometimes in certain use cases, the branch delay slot might not
    be executed? Even with programming it directly in ASM and using GAS to assemble it?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 12:08:33 2026
    From Newsgroup: comp.arch

    On 6/6/2026 12:03 PM, Chris M. Thomasson wrote:
    On 6/6/2026 11:25 AM, Chris M. Thomasson wrote:
    On 6/5/2026 6:44 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 6/5/2026 7:02 AM, Michael S wrote:
    On Thu, 4 Jun 2026 18:28:43 -0700
    "Chris M. Thomasson" <[email protected]> wrote:

    On 6/4/2026 7:21 AM, Scott Lurndal wrote:
    Andy Valencia <[email protected]> writes:
    I do not think it is impossible for an architecture to make
    guarantees about LL/SC operations.

    I was at Sequent when we were really serious about moving off Intel >>>>>>>> onto MIPS.  We looked at LL/SC really, really hard.  Lock traces >>>>>>> >from current systems, SW simulations, down to gate-level
    simulations.
    We ended up being sufficiently confident (as in, bet the program, >>>>>>>> by implication bet the company) that it would work as efficiently >>>>>>>> as our current Intel atomics at up to 8-way 64-bit MIPS CPU's.  And >>>>>>>> that it was very likely to scale without undue incremental design >>>>>>>> work to ~32 CPU's.

    I was at Unisys in that same timeframe;  we had planned on building >>>>>>> the SPP (scalable parallel processor aka OPUS) using motorola 88110 >>>>>>> CPUs, until Apple went PPC and Moto canceled 88110.   So we
    investigated MIPS, SPARC and Pentium Pro.  Our target was for a 64+ >>>>>>> processor SPP.  After evaluation, we chose Pentium Pro to build the >>>>>>> system (using the Intel Paragon backplane).

    I don't recall the details of the MIPS evaluation, but we were
    concerned at the time about the scalability of LL/SC.   SPARC never >>>>>>> made it out of the first evaluation round.

    Why? I had a SunFire T2000 that, when programmed correctly, was
    pretty fast for certain worksets and algorithms. RMO mode.

    RMO mode?
    I am pretty sure that T2000 had no RMO mode.

    If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware >>>>> were UrtraSPARC and UrtraSPARC II.

    Oh shit, I think you are right! I sometimes get my old SPARC boxes
    mixed up.

    Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
    defines three memory models: TSO, PSO, and RMO.

    It still needed an explicit membar for a store followed by a load to
    another location, even in TSO.

    Actually, I forgot how I go some sparcs in RMO mode. PSTATE?


    Starting from UrtraSPARC III Cu, all Sun SPARC processors are
    documented
    to be TSO-only. The processor, for which I didn't find a definite
    statement is an original UrtraSPARC III (Chitah), but I would be very >>>>> surprised if it is not the same as UrtraSPARC III Cu.

    SPARC-T line (originaaly named Niagara) was TSO-only from the very
    start.
    The only remnant of RMO in these processors are Block load and store >>>>> operations operations - they behave as RMO regardles of processor's
    global memory mode.

    Remember that old thing in one of the SPARC docs that explicitly
    mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

    SPARC used nullification in delay slots.


    Iirc, might be wrong here, a MEMBAR can force processor serialization
    or stall the pipeline until the store buffers drain, executing it
    right when the processor is updating the PC and nPC for a branch
    created nasty timing hazards? God its been a long time since I read
    the docs...

    Or iirc, sometimes in certain use cases, the branch delay slot might not
    be executed? Even with programming it directly in ASM and using GAS to assemble it?

    Hyper dangerous case. If a MEMBAR instruction is "skipped", then another
    one bites the dust! Memory racer!

    Fwiw, some tech relief, a song to go with it:

    (Queen - Another One Bites The Dust (Official Video))

    https://youtu.be/eqyUAtzS_6M?list=RDeqyUAtzS_6M

    ;^D

    Memory race... A song for it.. rofl!

    (Charli XCX - Speed Drive (From Barbie The Album) [Official Audio]) https://youtu.be/TxZwCpgxttQ?list=RDTxZwCpgxttQ

    Sorry, just a brain coolant. ;^)

    --- Synchronet 3.22a-Linux NewsLink 1.2