Forum: War Ensemble BBS

Re: ARM CAS vs LL/SC

From Paul Clayton@[email protected] to comp.arch on Wed May 20 19:24:12 2026

From Newsgroup: comp.arch

On 5/13/26 9:48 PM, MitchAlsup wrote:

Paul Clayton <[email protected]> posted:

On 5/11/26 10:38 AM, Scott Lurndal wrote:

[snip]

For HW to 'recognize' a LL-OP-SC as an idium, it would have to be
'local'. 'Local' probably means that no-to-few instructions that
are not in the LL-OP-SC sequence (with a strong preference for none).

That is: HW car recognize:

LDDL R9,[IP,#48]
ADD R10,R9,#1
STDL R10,[IP,#48] // same address "pattern"

as an idiom fairly easy:: but not:

MOV R8,#48
MOV R6,#48
...
LDDL R9,[IP,R6]
ADD R10,R9,#1
STDL R10,[IP,R8] // same address "pattern" ???

Said recognized idiom can be packaged up and shipped out through
the memory hierarchy as an XADD without much trouble.

In the context of a compiler emitting a specialized atomic
instruction or a (idiomatically similar) LL/SC sequence, this is
not an issue. If the compiler can emit "AtADD R10 ← [R9], #1",
it could emit "LL R10 ← [R9]; ADD R10 ← R10, #1; SC [R9] ← R10;"
and hardware could convert that to behave as an "AtADD".

For conventional LL/SC, the example would also fail if R6 ≠ R8,
and I do not see a reason the compiler would generate such code.
(Maybe using the "free" check might be justified in some unusual
case?)

For a more flexible LL/SC interface — even one that merely
allowed the SC to target any location in the same cache block
reservation as the LL — such code might be reasonable and not
trivially recognized as performing a simple operation on a
single address (i.e., exportable to simple core-external
hardware). That would just be a missed optimization opportunity.

Please elaborate. There are few restrictions on the instructions
that lie between the LL and SC instructions - I don't see how
any CPU could translate an arbitrary sequence of instructions
between the LL and SS into an atomic bus operation efficiently.

The hardware implementation can choose which LL/SC-guarded
operations to export,

No, no, no; HW exports LL and SC and SW uses them as it sees fit.
The only choice HW has is whether LL and SC are user instructions.

I do not understand. Hardware can choose not to export "LL R10 ←
[R9]; SINE R10 ← R10; SC [R9] ← R10;" and perform such as a
"normal" LL/SC operation. (In-cache hardware is unlikely to
support transcendental FP operations and nor is PCI likely to
define support for such soon.)

which to optimize into a fast path within
the processor, and which to treat conventionally. Even in a more
conventional implementation, NAKs or deferred responses might be
used to promote forward progress.

This does require software developers to monitor what
optimizations are implemented, at least if there are
alternatives with possibly more desired performance
characteristics.

Unworkable.

It would be unworkable if such monitoring was necessary for most
software. For peak performance, microarchitectural knowledge is
sometimes needed (cache characteristics, operation latency,
etc.). Making such specialization rarely measurably beneficial
and very rarely substantially beneficial seems important.

I seem to recall you stating that a 2% local performance penalty
was acceptable.

For LL/SC fast pathing, there may not be many cases where the
semantics can be expressed in a diverse enough manner that
faster expressions would be possible.

[snip]

Even with atomic instructions, I get the impression that the
explicit implementation (performance/scaling) is not
architecturally defined.

Scaling is not an Architectural property it is am implementation
property.

Yet scaling factors could determine which algorithm is higher
performance. E.g., if an atomic increment is not coalesced into
a tree (i.e., mediocre scaling), an algorithm that uses fewer
such operations but has other overheads might be chosen if/when
better scaling is desired.

This might be assigned to a more specific guarantee than the
Architecture (which is classically defined as
timing-independent), but that contract might be more general
than an implementation, whether a "family" of similar
implementations or a "profile" of non-Architectural behavior.

An atomic instruction might be
implemented with LL/SC with a guarantee of eventual success
(which would hopefully not be as bad as some x86 global lock for
cache block crossing LOCKed instructions).

You might be surprised at how glacial that eventual success is.

Chips and Cheese explored this recently. It is ugly.

(AArch64's STADD does not guarantee that the addition will be
done in the cache hierarchy even on a cache miss. The
architecture merely guarantees that the operation will be
atomic. An implementation could optimistically use an LL/SC-
based mechanism and fall back to locking rather than just
monitoring the reservation to ensure forward progress. With
out-of-order execution, the actual store to shared memory has
to be delayed until it is no longer speculative anyway,
replaying an atomic operation can be faster than a branch
misprediction — and even a branch misprediction can be fast
compared to communication between caches.)

IMO, LL/SC is an obsolete artifact of the past.

You, I, and Chris seem to agree on this detail.

Not really. You view LL/SC as too limited a form of optimistic
concurrency and not worth providing the implementation option of
smaller reservations or less features than ESM provides. To me,
My 66000's LOCKED memory instructions are basically the same as
LL/SC "merely" extended to support six cache lines within an
atomic scope and providing some other nice performance and
usability enhancements. (My 66000 is not targeted at the market
for 16-bit microcontrollers. The extra hardware for ESM is
small, especially in the context of how useful it can be.)

Scott Lurndal and Chris M. Thomasson at minimum see a place for single-instruction atomics (and seemingly not primarily to
improve code density or decode complexity), which I believe were
strongly rejected for My 66000 because of the need to add more
instructions as capabilities expanded (like with SIMD).

Eliminating optimistic atomic operations provided by an
LL/SC-like mechanism ("LL/SC is an obsolete artifact of the
past.") is actually contrary to My 66000's design philosophy.

(Maybe this is just my weird conception of transactional memory
as a general interface that can have its scope constrained to a
single "word" granule and still be considered transactional memory.)

There are certainly advantages to presenting a fully developed
interface that supports a broad range of uses rather than
incrementally extending an interface. It may well be wiser to
provide something like ESM from the start rather than starting
with classic LL/SC or even cache-block granular LL/SC (with
multiple loads and stores and the SC able to use a different
address than the LL) with published plans for extending the
interface.

I think ESM could be significantly extended (without adding
instructions). Any page-aligned copy could be contained in an
atomic operation by using a cache block monitor as a page
monitor (presumably with a bitmask to indicate which blocks have
been copied) — probably too specialized a use case to be worth
the development and testing costs but possible. Increasing the
number of cache blocks monitored would not require any new
instructions. Supporting a read set constrained by L1 cache
capacity or a conservative filter might not require any new
instructions (though you have stated experience with initial ESM
is needed to judge what the next step should be).

I do wonder if something more like lock elision could be useful
for increasing concurrency by reducing the number of names used
to track conflict (lock name versus cache block address).

I think there is potential with something like versioned memory
to support more concurrency. A "stale" value would still be
valid if the entire use of that value can be viewed as occurring
earlier. (In theory, an ESM operation need not be aborted if a
single read set cache block is written by another operation. The
practical problem seems to be that tracking the dependencies for
even a moderate number of atomic operations is complex. The
benefit for interleaving atomic operations may very well not be
worth so much complexity!)

I disagree. I _feel_ LL/SC is a nice abstract interface that
not only allows high-performance implementations of simple
atomics without requiring new software but can also (in theory)
be extended to multiple reservations (like My 66000's ESM) and
even to very general transactional memory. (I think a better
interface is possible with easier decode, better code density,
and the opportunity for hints and/or directives, but such would
introduce other costs.)

I see specific atomic operations as somewhat attractive (idiom
recognition is nice but it is not free), but potentially
susceptible to an excessive expansion of instructions. (SIMD
has similar tradeoffs. I like SIMD, but it has issues.)

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Wed May 20 19:33:33 2026

From Newsgroup: comp.arch

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead). Normal loads and stores within the code body would
be "guarded" and the SC could have a different address than the
LL. I.e., forward compatibility would be possible without adding
any Architectural state or new instructions while providing new
functionality.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Wed May 20 19:47:57 2026

From Newsgroup: comp.arch

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the bus
lock and still make forward progress... Sigh... A horrible LL/SC
thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)

Of course, the temptation toward "good enough" (not so bad that
one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Wed May 20 20:04:32 2026

From Newsgroup: comp.arch

On 5/14/26 11:03 AM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/11/26 10:38 AM, Scott Lurndal wrote:

[snip]

IME, atomic operations at the instruction set level have
not been implemented with LL/SC, even on architectures
that have LL/SC (or LDEX/STREX). The typical atomic
operations are designed so that they can generate
atomic PCIe (or other on-chip) transactions which cannot be
simulated using LL/SC.

I seem to recall reading Andy Glew mentioning that an x86
implementation was using such an internal mechanism — and he
expressed concerns about how it would ensure the Architectural
guarantees.

As I wrote before, any simple LL/SC operation that could be
replaced by the compiler with a simple atomic instruction could
be recognized by hardware at a special case for optimization and
made to behave as if it was a single atomic instruction.

[snip]

We'll have to agree to disagree. I consider the lack of scalability
of LL/SC to be a fatal defect.

I believe the lack of scalability is an implementation choice
and allowing that poor scalability is an Architectural choice.
I.e., this is not about the instruction interface so much as
about quality of implementation (and Architectural or "profile"
guarantees).

Maybe practically one cannot trust processor developers (and
those defining the guarantees) to do the extra work to close
that gap. Maybe advertising atomic instructions is more
effective than advertising well-implemented LL/SC. (I am
sufficiently discouraged about human nature and current human
society to believe that "well-implemented LL/SC" is a
cloud-cuckoo-land concept.)

I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Thu May 21 20:17:14 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).

It would be limiting to tie LL/SC to cache lines.

Atomics are independent of the cache, and can be used with
both cacheable and non-cacheable memory as well as
CXL and PCI Express devices.

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Thu May 21 20:22:46 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> writes:

On 5/14/26 11:03 AM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/11/26 10:38 AM, Scott Lurndal wrote:

[snip]

IME, atomic operations at the instruction set level have
not been implemented with LL/SC, even on architectures
that have LL/SC (or LDEX/STREX). The typical atomic
operations are designed so that they can generate
atomic PCIe (or other on-chip) transactions which cannot be
simulated using LL/SC.

I seem to recall reading Andy Glew mentioning that an x86
implementation was using such an internal mechanism — and he
expressed concerns about how it would ensure the Architectural
guarantees.

As I wrote before, any simple LL/SC operation that could be
replaced by the compiler with a simple atomic instruction could
be recognized by hardware at a special case for optimization and
made to behave as if it was a single atomic instruction.

[snip]

We'll have to agree to disagree. I consider the lack of scalability
of LL/SC to be a fatal defect.

I believe the lack of scalability is an implementation choice
and allowing that poor scalability is an Architectural choice.
I.e., this is not about the instruction interface so much as
about quality of implementation (and Architectural or "profile"
guarantees).

Maybe practically one cannot trust processor developers (and
those defining the guarantees) to do the extra work to close
that gap. Maybe advertising atomic instructions is more
effective than advertising well-implemented LL/SC. (I am
sufficiently discouraged about human nature and current human
society to believe that "well-implemented LL/SC" is a
cloud-cuckoo-land concept.)

I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.

Functionality guarantees, yes. Performance has to suffer,
unless the hardware can analyze all the instructions between
the LL/SC and abstract them into a single bus operation; which
I don't see as feasible.

If you can figure out how to implement LL/SC optimally
to CXL remote memory for the same set of atomic operations
provided by PCI express, I'd be interested in the result.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri May 22 17:38:42 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> posted:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for

It took me an entire year (2000+ hour) to create ASF after knowing
how LL/SC works. The "here is the basic idea" was only a couple of
days--the rest of the time was making "here are a small number of
cache lines", "make them all available at the same time", in such
a way that "you can make all updates appear system wide in a single
instance" or "make them appear to have never been modified" with
semantics that work EVEN IF YOU DO NOT HAVE A CACHE in the CPU.

Then there is multiple-LL memory order semantics,
detection of interference,
a system arbiter when interference is heavy,
and what to do when interference prevents completion.

LL/SC is easy, compared to making multiple-LL and multiple-SC
work.

small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead). Normal loads and stores within the code body would
be "guarded" and the SC could have a different address than the
LL. I.e., forward compatibility would be possible without adding
any Architectural state or new instructions while providing new functionality.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri May 22 17:42:50 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> posted:

On 5/14/26 11:03 AM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/11/26 10:38 AM, Scott Lurndal wrote:

[snip]

IME, atomic operations at the instruction set level have
not been implemented with LL/SC, even on architectures
that have LL/SC (or LDEX/STREX). The typical atomic
operations are designed so that they can generate
atomic PCIe (or other on-chip) transactions which cannot be
simulated using LL/SC.

I seem to recall reading Andy Glew mentioning that an x86
implementation was using such an internal mechanism — and he
expressed concerns about how it would ensure the Architectural
guarantees.

As I wrote before, any simple LL/SC operation that could be
replaced by the compiler with a simple atomic instruction could
be recognized by hardware at a special case for optimization and
made to behave as if it was a single atomic instruction.

[snip]

We'll have to agree to disagree. I consider the lack of scalability
of LL/SC to be a fatal defect.

I believe the lack of scalability is an implementation choice
and allowing that poor scalability is an Architectural choice.
I.e., this is not about the instruction interface so much as
about quality of implementation (and Architectural or "profile"
guarantees).

Maybe practically one cannot trust processor developers (and
those defining the guarantees) to do the extra work to close
that gap. Maybe advertising atomic instructions is more
effective than advertising well-implemented LL/SC. (I am
sufficiently discouraged about human nature and current human
society to believe that "well-implemented LL/SC" is a
cloud-cuckoo-land concept.)

I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.

You cannot make an LL/SC architecture that can do both Test-and-set
and Compare-and-swap with commonly held semantics of T&S and CAS.
One requires monitoring the LL address for interference from the
LL to the SC, the other requires not knowing about interference
and only checking of data-equivalence at SC.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Sun May 24 17:24:47 2026

From Newsgroup: comp.arch

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

The less specifically the size is defined, the less performance-
portable software becomes. One can address this with something
like RISC-V profiles, in which sizes can be more specific and
software that cares will specify a target profile rather than an
Architecture (version).

Since granule size can influence what code is most efficient,
even recompiling is not an excellent option. So for a class of
applications, having a single target seems to make sense.

Being able to test software on a development machine can also be
useful, so desired performance compatibility might be broader
than a application type.

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).

It would be limiting to tie LL/SC to cache lines.

It is not tying the operation to cache lines but to cache
line granules in terms of external interference monitoring
(and, in the case of a modest extension beyond traditional
LL/SC, the scope of the read/write set).

Atomics are independent of the cache, and can be used with
both cacheable and non-cacheable memory as well as
CXL and PCI Express devices.

I am not certain that LL/SC (or an extended form of such)
could not be used with "I/O" addresses. This merely requires
the equivalent of one cache line "cache" (or the largest
guaranteed size of a transaction) and some form of
monitoring ("coherence") of such memory addresses.

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order? The former seems more orthogonal — all
agents are treated the same — but the latter seems more
consistent with actions should occur the same as if no other
threads were running. If the I/O agent is considered just
another agent, then I/O addresses with side effects within the
granule might reasonably be considered interference causing the
transaction to always fail.

I do not know how an atomic operation instruction would handle
a perverse case. If such instructions generate an exception, a
LL/SC sequence could do the same or produce an "always fail"
transaction failure indicator (probably with additional
metadata to indicate the nature of the failure).

I do not know what the monitoring implications for supporting
I/O atomics would be. For simple operations, translation to the
equivalent of an atomic instruction seems reasonable. However,
if more extensive operations are permitted, then considerable
care seems necessary to define semantics that are
comprehensible, testable, and cost-effective.

My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases. [I am being lazy and not waiting
to look up
this information and edit this before posting.])
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Sun May 24 21:32:47 2026

From Newsgroup: comp.arch

On 5/22/26 1:38 PM, MitchAlsup wrote:

Paul Clayton <[email protected]> posted:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for

It took me an entire year (2000+ hour) to create ASF after knowing
how LL/SC works. The "here is the basic idea" was only a couple of
days--

This is one of the benefits of others' recording their
experiments. The "basic idea" becomes a trivial extension of
prior work or at least "obvious to one skilled in the art".

(Because I only think about hardware at a fairly high level of
abstraction, I can hand wave a lot of issues. I am also almost
completely ignorant of "system-level" issues and other complex
interactions. I do not mean to belittle the effort behind the
original ASF — though from some of your statements dealing with
"business issues" was a significant part of the effort and kept
AMD's ASF from having some of the features you developed.)

By the way, is there a reason that ESM did not include the
same operation provided by RELEASE in AMD's ASF? Is removing
entries from the transaction not worthwhile for the sort of
smallish transactions targeted by ESM? (It is also possible that
I missed the presence of such in ESM or that my version of
Principles of Operations (28 January 2020) is so out-of-date
that it is not accurate for ESM anymore.)

Discarding read set members seems tricky for software as it
would have to guarantee that no "overlapping" reads occurred.
Such is possible if multiple data structures do not share a
cache block (or more complexly if any possible cache block
sharers are never involved in together in a transaction that
discards possible sharers).

The ASF justification for RELEASE — "RELEASE can be used to
circumvent ASF's capacity limitations when traversing
potentially long chains of pointers." — is a limited use case
*and* being a "hint" it did not increase the guaranteed capacity
(four 64-byte memory regions) so a transaction would still
require fallback code.

ASF also supported "unprotected" (transaction escaping) memory
accesses (those that do not use the LOCK prefix), which I think
ESM does not provide. Such could be useful for thread-local and
"ROM" accesses (to avoid capacity issues) and for "shared"
unconditional accesses (which might be guarded by a rarely
contested lock, have software handling for inconsistent state,
or be hint-like such as software performance counters). I guess
such could also be useful for any data that is "unreachable"
until the transaction commits such as a new memory allocation,
but this seems the same as thread-local memory or a lock-guarded
memory set.

the rest of the time was making "here are a small number of
cache lines", "make them all available at the same time", in such
a way that "you can make all updates appear system wide in a single
instance" or "make them appear to have never been modified" with
semantics that work EVEN IF YOU DO NOT HAVE A CACHE in the CPU.

The practical implementation aspects naturally take longer.

I would have guessed that the NAK trick and the interference
counter trick did not come to mind in the first moment. The NAK
trick may have been more obvious in concept ("almost done, just
give me a sec") but working such out that it would not cause
performance issues or even (practically) lock-ups is harder.
(The replay problem for an out-of-order scheduler seems simple
enough in concept but is a hard problem in an actual high-
performance design.)

The usability tuning (and extension directions) tends to require
actual hardware (simulation may be too expensive and limited
mostly to internal exploration — having users attempt crazy
things can be helpful). Some extension possibilities are
obvious (and some obviously practical at least in terms of
hardware cost), but even some of the tuning of the performance
of the existing interface might require years of experience with
use of the hardware by a reasonably broad range of users.

(I do not think not having a proper cache is so much of a
problem. Yes, the coherence interface would have be added to
detect interference and enough buffering to supply the required
capacity. On the other hand, I would be tempted to use that
storage for other things, maybe prefetch buffers. For non-
scalable systems, it might be practical to share buffers among
multiple agents, e.g., allowing 16 cache blocks among 8 cores
with no cache, which could also move coherence to that central
storage, but that seems to imply writing directly to such
"distant" storage.)

I think another factor is having the atomic operation be
inexpensive both in the case of no interference and in the retry
case. I got the impression that Intel's TMX (and lock ellision)
were expensive both in set-up and especially in retry; part of
this may be from competing with slow LOCK-based atomic
instructions.

In theory, a failure should not be much more expensive than a
branch misprediction. (An "advanced" implementation could
provide faster retries for small transactions by keeping the
decoded instructions in schedulers and requesting updates from
writes by external agents [i.e., as soon as the external write
committed or the invalidation response was received if the write
committed before that, an implicit read request would be acted
upon].) Even with L1 cache capacity transactions, clearing all
the transaction bits can be fast and if most of the transaction
was read set a retry should run fairly fast.

(Write set cache blocks would normally be written back to L2 to
provide a checkpoint, so retrying would involve an L1 miss for
all write set blocks. This could be avoided for small write sets
that fit in load-store queue entries or some other buffer.
Keeping a list of written cache blocks would also allow
prefetching (and retention in L2 so checkpointing would not
require writebacks).)

Then there is multiple-LL memory order semantics,
detection of interference,
a system arbiter when interference is heavy,
and what to do when interference prevents completion.

LL/SC is easy, compared to making multiple-LL and multiple-SC
work.

Which is one reason that I though a cache line granular LL/SC
might be a reasonable next step beyond traditional LL/SC.

Single address atomic operations (with larger granule) are
also attractive in facilitating commit order optimization.
(Many single "word" atomic operations facilitate more
optimizations like remote execution and coalescing for
operations like atomic add.)

Of course, once someone has worked out how to do multiple
cache block reservations well, limiting an implementation to
one cache block might not be reasonable.

One issue with a LL/SC-oriented interface is that a fully
external operation is harder to express. Zeroing all the
touched registers immediately after the operation technically
would communicate that the loaded values and their descendants
were not preserved outside the transaction, but that would be
a messy idiom to detect. With a read-as-zero-or-last-write
register, single value "push-only" operations might be easier
to detect; though detecting this for LL R2 ← [R1];
OP R2 ← R3, R4; SC [R1] ← R2; might not be that much more
difficult than for LL R0 ← [R1]; OP R0 ← R3, R4; SC [R1] ← R0;
(special casing one register that is already special cased
may be a little easier).

Side comment: ESM and extended LL/SC mechanisms seem to have a
problem with expressing "exportable" operations even at the
close of a transaction. Operations that could be performed
remotely or even coalesced could avoid "false" interference if
hardware knew that the operation did not use the loaded value
except to compute a new value and did not use the new value
within the transaction except to store it to the original
address (though copy and perhaps swap operations might also be
practical targets for exporting).

E.g., a bank transfer might only need to read the current
balance from the provider account (to ensure there are
sufficient funds) and subtract the amount transferred and then
use an exportable atomic add for the other account. (Software
might guarantee no overflow fairly easily with 64-bit integers
(and no transfers larger than the U.S. debt ☹) or hardware might
provide an exception and software could correct the problem.)

The exported operation still needs to be conditional on the
transaction (otherwise it could just be a separate transaction,
though that might be very expensive in some implementations),
but it does not have the same kind of data dependency that other
atomic operations have.

In theory, multiple operations could be exported if they are all
dependent only on the other parts of the transaction committing.
However, ensuring that the ordering guarantees are enforced
seems likely to be very difficult with more than one simple
atomic operation.

I _think_ any single store (which could include a cache block
with a write mask) could *theoretically* be exported if the
block was not in the transactions read set. This would still
delay the commitment of the transaction until after the outer
memory system confirmed that there were no conflicts (so the
latency would be similar to cache miss handling) because the
read set still needs to be guarded, but such might reduce
conflicts either by using finer-grained interference detection
or by facilitating optimization of ordering (using more
central arbitration).

I am rather skeptical that there are significant uses for such
"blind stores" much less enough to justify such complexity. Yet
if I am reasoning correctly, such could slightly improve
performance of that corner case.

Another side comment: in theory, a shared bump counter
allocation could use an exported atomic add and not need the
actual result until after the transaction if a temporary pseudo-
address is assigned as a placeholder value for the address that
will be generated by the atomic add (and if there was a
guarantee that the allocation would succeed or dropping the data
was acceptable on allocation failure — providing a "red zone"
of memory for such "failed" allocations would be another
option and when the red zone reaches a watermark all such
allocation optimizations are not performed).

(Memory allocation in general is separable in this manner. The
actual address returned by the allocation in not needed until
the allocation is visible outside of the thread; a placeholder
address can provide coherence within a thread. In theory, a
placeholder address could even be used between threads if
part of the virtual address space is reserved for such uses, but
replacing such uses seems like doing garbage collection for a
C program. With capabilities or marked pointers, such data
would be distinguishable as pointers and so might be garbage
collected at some cost. [Hmm. Would there be any value in a
page-granular protection that only prohibited writing to
(marked) pointers?])

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Sun May 24 23:35:20 2026

From Newsgroup: comp.arch

On 5/21/26 4:22 PM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

[snip]

I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.

Functionality guarantees, yes. Performance has to suffer,
unless the hardware can analyze all the instructions between
the LL/SC and abstract them into a single bus operation; which
I don't see as feasible.

If you can figure out how to implement LL/SC optimally
to CXL remote memory for the same set of atomic operations
provided by PCI express, I'd be interested in the result.

I am not a hardware designer, but recognizing LL Rx ← [Ry];
OP Rx ← Ra, Rb; SC [Ry] ← Rx and converting it to the
appropriate PCIe atomic (when "OP" is a PCIe supported atomic
operation) does not seem that difficult. Yes, three instruction
idioms are more complex than two instruction idioms, but the
first part of detection (destination of first instruction is the
same as the source for the following instruction) is a common
idiom detection factor and necessary even for in-order
superscalar execution.

CMP+Jn fusion in x86 is a little simpler since Jn will always be
dependent on an immediately preceding CMP (there is only one
flags register), but it still requires comparing two opcodes.

For the proposed limited LL/SC fusion, I think the following
logic suffices:

if I[0].opcode == LL
and
if I[0].Rdst == (I[1].Rsrc1 or I[0].Rsrc2)
and
if I[0].Rdst == I[2].Rsrc1
and
if I[2].opcode == SC
and
if I[1].opcode == EXPORTABLE

It would probably be acceptable to assume that if the third
instruction is a SC, it will also meet the pattern (so a very
unusual misspeculation — conditionally storing a value unrelated
to the linked load — could be handled slowly) and check that the
register is the same later (though that register check might not
add latency since dependencies need to be checked anyway). It
may also be acceptable to delay the operation check as the LL
address generation is required regardless of how the
operation is actually handled.

An alternative would be to fuse every three instruction LL/SC
sequence and crack the fused instruction later if the operation
is not one supported in a fused format. (This cracking could be
independent of whether the operation can be exported. An
implementation might have a scheduler that fires an operation
multiple times in different "modes" such that it could execute
this fused operation internally.)

For three-instruction LL/SC sequences, there is also very little
reason for the intermediate instruction not to use the LL result
as a input value. So one could probably speculate that an
operand is the same and replay from fetch on misspeculation.

This would have the fusion be dependent only on one comparison
of about two sets of about six bits (admittedly, separated by
the length of two instructions). For a narrow decode
implementation this seems inappropriate.

Such detection would require some extra buffering (even a
wide decode implementation would have to handle crossing decode
chunks), but such seems a modest overhead.

Delaying a potentially exportable atomic operation by a cycle or
two would also seem not to be very problematic. Even in an in-
order implementation, the atomic operation cannot be exported
until after all previous operations are guaranteed not to
produce exceptions and the operands are available.

I think I would prefer a specialized form of LL that produced an
implicit SC after one (or N) instructions both to assist such
idiom recognition and to provide code density (no SC instruction
and no success test instruction — in that way similar to IBM's
limited transactions which are guaranteed to complete). Even
with an N-instruction body LL, there would only be a comparison
of one opcode to a constant and the count field to one to detect
the single-instruction case with an additional opcode comparison
to determine if it is exportable. This does introduce one
"extra" opcode, but it avoids adding an opcode for every
possible exportable atomic operation and facilitates old
software using new hardware features. The code density benefit
would be greatest for single instruction bodies (100% extra
overhead relative to a specialized atomic instruction compared
to 300% overhead — SC, branch-on-interference — of traditional
LL/SC), but I am not certain if limiting the opcode to that is
best.

I also am inclined to provide an interface that allows avoiding
an explicit interference test and branch. For simple
transactions that hardware can reasonably guarantee will
complete, automatic retry is practical. Having a special
exception for "always fail" or "recommend no retry" would add
overhead associated with a single handler managing multiple
failure points (admittedly on a generally less critical path)
with the only benefit being slightly shorter dynamic code by
removing a branch instruction. Since the cases where completion
could not be guaranteed would tend to be long, the cost of a
branch instruction may not be significant.

(I just noticed that My 66000 has "predicate on the condition of interference", which may allow escaping memory accesses.)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue May 26 12:44:20 2026

From Newsgroup: comp.arch

On 5/24/2026 2:24 PM, Paul Clayton wrote:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?

FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue May 26 20:58:52 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 5/24/2026 2:24 PM, Paul Clayton wrote:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?

Does this "LL/SC and other core instructions synchronization means" not
fall from "desirable" when one has a complete set of to-memory() atomic
actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the
quadratic and cubic interconnect traffic in the system which are the
real point of slow synchronization ??!!?? while being guaranteed to
work without an interference and can be done for both cacheable and
unCacheable memory accesses ??!!??

FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue May 26 14:00:36 2026

From Newsgroup: comp.arch

On 5/26/2026 1:58 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 5/24/2026 2:24 PM, Paul Clayton wrote:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?

Does this "LL/SC and other core instructions synchronization means" not
fall from "desirable" when one has a complete set of to-memory() atomic actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the quadratic and cubic interconnect traffic in the system which are the
real point of slow synchronization ??!!?? while being guaranteed to
work without an interference and can be done for both cacheable and unCacheable memory accesses ??!!??

Take a look some S/HTM... A single load can cause a retry, and lead to
live lock?

FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Wed May 27 14:25:19 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> writes:

On 5/21/26 4:17 PM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]

Wrt LL/SC, how large is the reservation granule? PPC has some
insight...

Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.

ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).

Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.

The less specifically the size is defined, the less performance-
portable software becomes. One can address this with something
like RISC-V profiles, in which sizes can be more specific and
software that cares will specify a target profile rather than an
Architecture (version).

Since granule size can influence what code is most efficient,
even recompiling is not an excellent option. So for a class of
applications, having a single target seems to make sense.

Being able to test software on a development machine can also be
useful, so desired performance compatibility might be broader
than a application type.

I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).

It would be limiting to tie LL/SC to cache lines.

It is not tying the operation to cache lines but to cache
line granules in terms of external interference monitoring
(and, in the case of a modest extension beyond traditional
LL/SC, the scope of the read/write set).

Atomics are independent of the cache, and can be used with
both cacheable and non-cacheable memory as well as
CXL and PCI Express devices.

I am not certain that LL/SC (or an extended form of such)
could not be used with "I/O" addresses. This merely requires
the equivalent of one cache line "cache" (or the largest
guaranteed size of a transaction) and some form of
monitoring ("coherence") of such memory addresses.

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?

For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order?

Atomics ensure that the access is atomic with respect to
all other accessors - ensuring that the other accessors
will not see inconsistent data.

Atomics can be used as a basis (e.g. atomic test&set) to
guard a critical section, but they're also useful for
adjusting shared counters et alia.

My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases.

See above.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed May 27 14:08:17 2026

From Newsgroup: comp.arch

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the bus lock
and still make forward progress... Sigh... A horrible LL/SC thing can
live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
will complete before the heat death of the universe"☺)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is hyper
important to help the software pad and align to remove any false sharing
on said granule. No? But...

Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
even if you do everything right in principle, you're tuning against a
number you had to dig out of a forum post or reverse engineer yourself.
Scary! ;^o

Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the cognitive
load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed May 27 14:14:11 2026

From Newsgroup: comp.arch

On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
[...]

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.

A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
LL/SC cannot... ?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed May 27 14:24:36 2026

From Newsgroup: comp.arch

On 5/27/2026 2:14 PM, Chris M. Thomasson wrote:

On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
[...]

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very
few.

A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
LL/SC cannot... ?

For x86, its "easier" for sure... pad _and_ align on a l2 cache line,
and you should be ideal... SO NO straddle a cache line and execute a
damn LOCK RMW on it. Bus lock for sure.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Thu May 28 01:27:36 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the bus lock
and still make forward progress... Sigh... A horrible LL/SC thing can
live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
will complete before the heat death of the universe"☺)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is hyper important to help the software pad and align to remove any false sharing
on said granule. No? But...

Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
even if you do everything right in principle, you're tuning against a
number you had to dig out of a forum post or reverse engineer yourself. Scary! ;^o

Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the cognitive
load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.

Following a "SC failure" My 66000 provides a readable control register
called 'WHY' which contains a number. Negative numbers represent kinds
of failures {resource limit exceeded, time out, ...} while positive
values indicate how far back in-line your request is (measured by a
resource which has unique system-wide visibility to ATOMIC-order}.

Thus, SW can use WHY to reach deeper into the Queue of pending work and
select a unit that nobody else is going to go after on the next iteration.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Sun May 31 21:32:14 2026

From Newsgroup: comp.arch

On 5/27/26 5:08 PM, Chris M. Thomasson wrote:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations. IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...

I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.

(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)

I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.

Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o

Ugh!

Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.

Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.

Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Sun May 31 23:26:39 2026

From Newsgroup: comp.arch

On 5/27/26 10:25 AM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

[snip]

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?

A more generic interface has some advantages.

I already mentioned that old software that was developed when
there was not an atomic ["expensive" operation] instruction
could benefit from idiom recognition on new hardware. (An
alternative to that would be patching or recompiling the
software. While I prefer a more abstract software distribution
format for its ability to avoid having to move things to
Architecture and even potentially perform microarchitectural
optimizations at non-instruction granularity, such seems
unlikely to be common any time soon.)

Even with atomic instructions, the Architecture generally does
not provide guarantees about scalability. I doubt any
implementation would stop-the-world to perform an atomic
operation (because the performance penalty would be quite
noticeable), but I can easily imagine an implementation
waiting until the atomic operation is not speculative before
starting it.

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized. (System calls
have similar excessive, in my opinion, latency. Some of this may
be from cruft, but I received the impression that optimization
effort is a significant cause for the higher latency.)

I do not like the code bloat and decode complexity of using
LL/SC for simple atomic operations. Unfortunately, even a LL-and-SC-after-next-compute instruction (which would allow
arbitrary single compute instruction atomics and might be
extended by function call instructions to microcode) would have
the bloat of redundant register name encoding. Even a diversity
of addressing modes may be excessive for atomic operations, if
simple register-indirect with no offset is sufficiently common.

With destructive operations (like x86), it would be possible to
avoid the register name overhead by having the LL instruction
not include a register name, taking it from the following
compute instruction. For an LL instruction lacking a register
name, if "microcode" calls were to be supported such call
instructions would need to specify a register name (or use a
defined, possibly function-specific ABI). An opcode-only LL
might reasonably have space for hint/directive metadata, which
might be useful.

My objection to specific atomic instructions is mainly that
they are specific. If an operation later becomes a reasonable
target for such an instruction, a new instruction must be
allocated to provide that operation. That new instruction would
only be available to new software.

For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order?

Atomics ensure that the access is atomic with respect to
all other accessors - ensuring that the other accessors
will not see inconsistent data.

I think I communicated poorly. I was thinking about what the
appropriate behavior of an atomic add operation (however
encoded) should be when targeting an address with side effects.
The simple choice is "don't do that" (undefined behavior). The
slightly more complex choice is fault on bad behavior.

Yet one might argue that targeting such an address for an atomic
operation could be useful in some particular context. Supporting
such means making a choice of how the side effect is handled.

(I am inclined to just having such fault, but that needs to be
defined as it means that acquiring a lock, performing a read,
operating on the read value, writing the result, and releasing
the lock is not functionally equivalent to an atomic operation.)

Is the read side effect ignored? For side effects limited to the
accessed address, this would seem to be the same as the side
effect happening "between" the read and the write. For side
effects with external effects, those would also be suppressed,
making such different than having the side effect occur
"between" the read and the write.

Is the side effect done "between" the read and the write of the
"atomic" operation? This would presumably overwrite the address-
local side effect while producing other side effects, which
might seem very strange as the side effect would use the old
value for any value-dependent side effects.

Is the side effect performed after the atomic operation? This
could also be confusing.

Even if the side effect does not change the value at the
address, the value before or after the atomic operation might be
used to determine what the side effect is.

Removing side effects places atomics in a special category,
which may be reasonable but is not a choice 100% obvious to
everyone. Consistently and sensibly ordering side effects with
atomic seems challenging.

Such side effects are like atomic operations, which leads to a
conflict. If the non-side effect operation is truly atomic, one
might break the definition of the side effect.

I would guess that each device would choose its supported
behavior, but that would seem to add unnecessary complexity.
Just faulting on such use seems sensible, but then one needs
to distinguish between addresses that fault and addresses that
allow atomic operations.

I just looked it up, Power (version 2.06B) as an example
restricts Load Reserved to coherent memory: "The storage
location specified by the Load And Reserve and Store Conditional
instructions must be in storage that is Memory Coherence
Required if the location may be modified by another processor or
mechanism. If the specified location is in storage that is Write
Through Required or Caching Inhibited, the system data storage
error handler or the system alignment error handler is invoked
for the Server environment and may be invoked for the Embedded
environment." I therefore suspect that even if such was
extended to support PCI-E atomics, addresses with side effects
would fault.

Atomics can be used as a basis (e.g. atomic test&set) to
guard a critical section, but they're also useful for
adjusting shared counters et alia.

(There seem to be a lot of alia/other uses. Atomic OR seems like
a useful means of supporting multiple "named" read locks; if
implemented aggressively, atomic OR could even be used for
bit-sized locks in combination with atomic AND.)

My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases.

See above.

The "above" statement was not clear to me. An I/O device's
read side effect does not play nicely with the concept of
atomic. One could define the atomic not to actually "read"
the device register (no side effect), but I think one
cannot just say the operation is atomic.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 01:27:53 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> posted:

On 5/27/26 10:25 AM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

[snip]

In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.

If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?

A more generic interface has some advantages.

I already mentioned that old software that was developed when
there was not an atomic ["expensive" operation] instruction
could benefit from idiom recognition on new hardware. (An
alternative to that would be patching or recompiling the
software. While I prefer a more abstract software distribution
format for its ability to avoid having to move things to
Architecture and even potentially perform microarchitectural
optimizations at non-instruction granularity, such seems
unlikely to be common any time soon.)

Even with atomic instructions, the Architecture generally does
not provide guarantees about scalability. I doubt any
implementation would stop-the-world to perform an atomic
operation (because the performance penalty would be quite
noticeable), but I can easily imagine an implementation
waiting until the atomic operation is not speculative before
starting it.

Understand that LOCK XADD [...] to MMI/O does exactly this !

But note: XADD [...] never causes more than necessary bus traffic
and as an atomic event, never fails, never needs retry, ...
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 01:38:51 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> posted:

On 5/27/26 5:08 PM, Chris M. Thomasson wrote:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent guarantees. Using LL/SC to emulate them is a different story.

Academic LL/SC: I can agree with this statement. But neither ASF nor
ESM has problems making stronger guarantees--and I did this over
{7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
impose limitation on instruction order and SW has to understand
several nonVoneumann properties of the ATOMIC event.

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

That standard academic stuff cannot, does not mean it absolutely
cannot be done.

IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...

I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.

(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)

I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.

Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o

Ugh!

Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.

It took me more than 35 years to learn how to write µArchitecture
documents such that a malevolent engineer could not misunderstand
what was written and specified. Try it, it is not easy. It is not
something that can be taught, but it is something that diligence
and perseverance can deliver.

Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.

How many SC failures are acceptable if there are 1024 cores all
going after the same lock ??

Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Tue Jun 2 14:42:12 2026

From Newsgroup: comp.arch

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees, even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
| - The transaction executes no more than 32 instructions.
| - All instructions within the transaction must be within 256
| contiguous bytes of storage.
| - The only branches you may use are relative branches that
| branch forward (so there can be no loops).
| - All SS and SSE-format instructions may not be used.
| - Additional general instructions may not be used.
| - The transaction's storage operands may not access more than
| four octowords.
| - The transaction may not access storage operands in any 4 |K-
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.
| - Operand references must be within a single doubleword,
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.

I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 19:36:06 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> posted:

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.

The interconnect is going to route the package towards either a
cache having write permission or a control register.

The cache or control register will perform the packaged calculation
and optionally send back the previous value.

The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either directions.

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,

If so, you will be surprised when you implement one.

even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Where it becomes cubically harder.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.

Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
| - The transaction executes no more than 32 instructions.

I used a timer--to the same ends.

| - All instructions within the transaction must be within 256
| contiguous bytes of storage.

I allow calls to subroutines in the event.

| - The only branches you may use are relative branches that
| branch forward (so there can be no loops).

Loops are OK as long as the timer does not go off.

| - All SS and SSE-format instructions may not be used.

Agreed.

| - Additional general instructions may not be used.

I see no reason to limit general calculations and memory access.

| - The transaction's storage operands may not access more than
| four octowords.

8 cache lines participate, an unbounded number of cache lines
can be accessed as long as participants is no larger than 8.

| - The transaction may not access storage operands in any 4 |K-
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.

interdesting.

| - Operand references must be within a single doubleword,
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.

Any normal memory references to the participating lines.

I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

The thing that makes this so difficult is that most µArchitectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.

My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.

At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

See Lamport...

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.

...
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 13:52:39 2026

From Newsgroup: comp.arch

On 6/1/2026 6:38 PM, MitchAlsup wrote:

Paul Clayton <[email protected]> posted:

On 5/27/26 5:08 PM, Chris M. Thomasson wrote:

On 5/20/2026 4:47 PM, Paul Clayton wrote:

On 5/14/26 3:58 AM, Chris M. Thomasson wrote:

CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!

LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.

In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.

Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.

Academic LL/SC: I can agree with this statement. But neither ASF nor
ESM has problems making stronger guarantees--and I did this over
{7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
impose limitation on instruction order and SW has to understand
several nonVoneumann properties of the ATOMIC event.

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

That standard academic stuff cannot, does not mean it absolutely
cannot be done.

IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.

A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)

A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...

I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.

(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)

I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.

Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o

Ugh!

Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.

It took me more than 35 years to learn how to write µArchitecture
documents such that a malevolent engineer could not misunderstand
what was written and specified. Try it, it is not easy. It is not
something that can be taught, but it is something that diligence
and perseverance can deliver.

Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.

How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.

How many SC failures are acceptable if there are 1024 cores all
going after the same lock ??

Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.

Simple... Do NOT allow 1024 cores to hammer a single location!

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 14:15:24 2026

From Newsgroup: comp.arch

On 6/2/2026 12:36 PM, MitchAlsup wrote:

Paul Clayton <[email protected]> posted:

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.

The interconnect is going to route the package towards either a
cache having write permission or a control register.

The cache or control register will perform the packaged calculation
and optionally send back the previous value.

The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either directions.

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,

If so, you will be surprised when you implement one.

even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Where it becomes cubically harder.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.

Constrained transactions had these restrictions (from
https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
| - The transaction executes no more than 32 instructions.

I used a timer--to the same ends.

| - All instructions within the transaction must be within 256
| contiguous bytes of storage.

I allow calls to subroutines in the event.

| - The only branches you may use are relative branches that
| branch forward (so there can be no loops).

Loops are OK as long as the timer does not go off.

| - All SS and SSE-format instructions may not be used.

Agreed.

| - Additional general instructions may not be used.

I see no reason to limit general calculations and memory access.

| - The transaction's storage operands may not access more than
| four octowords.

8 cache lines participate, an unbounded number of cache lines
can be accessed as long as participants is no larger than 8.

| - The transaction may not access storage operands in any 4 |K-
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.

interdesting.

| - Operand references must be within a single doubleword,
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.

Any normal memory references to the participating lines.

I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

The thing that makes this so difficult is that most µArchitectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.

My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.

At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

See Lamport...

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.

...

Well, we can do something... we know that lock cmpxchg8b on a 32 bit
system can handle two adjacent cache lines. So, we can try to hold more
than that, but! its not ideal. For instance my multex can do it and
emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 14:20:44 2026

From Newsgroup: comp.arch

On 6/2/2026 2:15 PM, Chris M. Thomasson wrote:

On 6/2/2026 12:36 PM, MitchAlsup wrote:

Paul Clayton <[email protected]> posted:

On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]

But note: XADD [...] never causes more than necessary bus traffic

I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).

The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.

The interconnect is going to route the package towards either a
cache having write permission or a control register.

The cache or control register will perform the packaged calculation
and optionally send back the previous value.

The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either
directions.

and as an atomic event, never fails, never needs retry, ...

I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,

If so, you will be surprised when you implement one.

                                      even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.

Where it becomes cubically harder.

Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.

SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.

Constrained transactions had these restrictions (from
https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-
transactions):
| - The transaction executes no more than 32 instructions.

I used a timer--to the same ends.

| - All instructions within the transaction must be within 256
|   contiguous bytes of storage.

I allow calls to subroutines in the event.

| - The only branches you may use are relative branches that
|   branch forward (so there can be no loops).

Loops are OK as long as the timer does not go off.

| - All SS and SSE-format instructions may not be used.

Agreed.

| - Additional general instructions may not be used.

I see no reason to limit general calculations and memory access.

| - The transaction's storage operands may not access more than
|   four octowords.

8 cache lines participate, an unbounded number of cache lines
can be accessed as long as participants is no larger than 8.

| - The transaction may not access storage operands in any 4 |K-
|   byte blocks that contain the 256 bytes of storage beginning
|   with the TBEGINC instruction.

interdesting.

| - Operand references must be within a single doubleword,
|   except for some of the "multiple" instructions for which the
|   limitation is a single octoword.

Any normal memory references to the participating lines.

I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???

I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)

If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).

I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.

If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.

There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.

The thing that makes this so difficult is that most µArchitectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.

My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.

At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.

With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.

It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory
coherence protocols are also insufficient.

I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.

See Lamport...

Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.

...

Well, we can do something... we know that lock cmpxchg8b on a 32 bit
system can handle two adjacent cache lines. So, we can try to hold more
than that, but! its not ideal. For instance my multex can do it and
emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/ sV4WC_cBb9Q/m/SkSqpSxGCAAJ

I think that is why AMD allowed for LOCK RMW along with LL/SC?!
--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@[email protected] to comp.arch on Tue Jun 2 17:11:11 2026

From Newsgroup: comp.arch

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

Now, there was no thought of hundreds (or thousands) of CPU's. But
some of the pessimistic assumptions you might make of LL/SC (at least
as available in MIPS CPU's of that era) might need to be
revisited. Our best analysis said it would scale to very large
(for that time) database workloads.

Finances and other management things cancelled the program. Sequent
eventually went with their NUMA, ultimately being acquired by IBM. We
never found out how that system would've done in the real world.

I seem to remember its code name was "Model R" (RISC).

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Wed Jun 3 18:19:28 2026

From Newsgroup: comp.arch

Paul Clayton <[email protected]> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

Let's see:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 3 12:57:42 2026

From Newsgroup: comp.arch

On 6/3/2026 11:19 AM, Anton Ertl wrote:

Paul Clayton <[email protected]> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

Let's see:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW, regardless of the ins and outs of LL/SC vs LOCK RMW. Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread counter and summing them when we need to observe the actual count?
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Wed Jun 3 20:53:49 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW, >regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread >counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 3 15:15:53 2026

From Newsgroup: comp.arch

On 6/3/2026 1:53 PM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread
counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

Humm... I missed that. Anyway, you need to test them multi threaded...
Say our counters are per thread so an increment adds to its per-thread
counter instead of using a LOCK RMW. Then when the counter needs to be
sampled we can start summing up the per thread counts...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 3 15:23:43 2026

From Newsgroup: comp.arch

On 6/3/2026 3:15 PM, Chris M. Thomasson wrote:

On 6/3/2026 1:53 PM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
      1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
      1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
      1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
      1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

   !@   +!@
   7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

   !@   +!@
   8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread >>> counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

Humm... I missed that. Anyway, you need to test them multi threaded...
Say our counters are per thread so an increment adds to its per-thread counter instead of using a LOCK RMW. Then when the counter needs to be sampled we can start summing up the per thread counts...

It can be amortized in different ways. Per thread is pretty damn lean
and mean! ;^) Or we can have some tables of counters aligned and padded.
So, a thread can increment its assigned counter instead of its
per-thread count, or vise versa. But, the idea is to distribute things
so a shit load of threads are not hammering a single location.

It depends on the type of data or what the counters are being used for.
We can read them using std:memory_order_relaxed loads.

Thread 1: [ Counter A ] --> Relaxed Increment (No LOCK)
Thread 2: [ Counter B ] ---> Relaxed Increment (No LOCK)
Thread 3: [ Counter C ] ---> Relaxed Increment (no LOCK)
^
Sampling Thread: -------------------+ (Loops through with relaxed loads)
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Thu Jun 4 14:21:16 2026

From Newsgroup: comp.arch

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.
--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Thu Jun 4 10:23:36 2026

From Newsgroup: comp.arch

On 2026-Jun-03 14:19, Anton Ertl wrote:

Paul Clayton <[email protected]> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

Let's see:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

- anton

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.
CMPXCHG does not do this - to be atomic it must have a LOCK prefix.

--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Thu Jun 4 10:25:06 2026

From Newsgroup: comp.arch

On 2026-Jun-03 16:53, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 6/3/2026 11:19 AM, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.

It's two locations in these benchmarks: X and Y.

Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread
counter and summing them when we need to observe the actual count?

These benchmarks use per-thread storage: They are single-threaded.

- anton

They might be allocated in the same cache line.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Thu Jun 4 21:04:28 2026

From Newsgroup: comp.arch

EricP <[email protected]> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)

while the code for "x atomic!@" is:

mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13

As you can see, there is no XCHG in the !@ code.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 18:28:43 2026

From Newsgroup: comp.arch

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 18:33:41 2026

From Newsgroup: comp.arch

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <[email protected]> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)

while the code for "x atomic!@" is:

mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13

As you can see, there is no XCHG in the !@ code.

How is your data organized? Show me the structure?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 21:20:20 2026

From Newsgroup: comp.arch

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <[email protected]> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)

while the code for "x atomic!@" is:

mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13

As you can see, there is no XCHG in the !@ code.

XCHG does have the implied LOCK as EricP mentioned.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 4 22:56:47 2026

From Newsgroup: comp.arch

On 6/4/2026 6:33 PM, Chris M. Thomasson wrote:

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <[email protected]> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

: bench-!@
      1 5000000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
      1 5000000 0 do x atomic!@ y atomic!@ loop drop ;

...

On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.

The code for "x !@" is:

mov    0x8(%rbx),%r15
mov    %r13,%rax
mov    (%r15),%r13
mov    %rax,(%r15)

while the code for "x atomic!@" is:

mov    %r13,(%r10)
sub    $0x8,%r10
mov    0x8(%rbx),%r13
mov    0x8(%r10),%rax
add    $0x8,%r10
xchg   %rax,0x0(%r13)
mov    %rax,%r13

As you can see, there is no XCHG in the !@ code.

How is your data organized? Show me the structure?

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 07:04:17 2026

From Newsgroup: comp.arch

[email protected] (Anton Ertl) writes:

Paul Clayton <[email protected]> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

I have increased the loop count by a factor of 10, because I did not
subtract the startup overhead of Gforth; as a result, the startup
overhead is reduced from 3.3 cycles per execution of the relevant word
to 0.33 cycles.

I have also inserted 64 bytes between the variables, so that they are
in different cache lines. This should not make a difference, because
all accesses are in the same thread (i.e., no cache-ping-pong from
possible false sharing), but just in case.

What I did not do is to use several threads. The idea here is that
programmers will take measures that ensure that contention is rare,
but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
barriers have no extra cost, but in reality, they do have extra cost.
If you are interested in seeing data for the contended case, look at
the cache ping-pong benchmarks, e.g., on chipsandcheese. There is one
danger in my approach: Hardware could have a special optimization for
memory that is not shared between threads at all, and run slower if
the memory is shared, but not contended; I have never read about such
a mechanism, and I'll leave checking the performance with multiple non-contending threads for another day.

The source code now is:

variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

: bench-!@
1 50_000_000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 50_000_000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 50_000_000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 50_000_000 0 do x atomic+!@ y atomic+!@ loop drop ;

: bench-nobarrier
50_000_000 0 do x @ y @ 2drop loop ;

: bench-barrier
50_000_000 0 do x @ barrier y @ barrier 2drop loop ;

The results are:

Ryzen 8700G (Zen4):
!@ +!@ barr
2.4 2.4 1.8 no atomic/no barrier
9.2 8.3 7.1 atomic/barrier

Ryzen 3900X (Zen2; in contrast to the 8700G with 1 CCX, the 3900X has
4 CCXs that may need coordination):
!@ +!@ barr
2.9 4.5 2.2 no atomic/no barrier
19.1 19.0 17.5 atomic/barrier

Given that the cycles here are far below the cycles reported for
Inter-CCX cache ping-pong, I guess that there is no inter-CCX
communication (at least no bidirectional one) in this benchmark.

On to Intel:
Core i3-1315U P-core (Golden Cove):
!@ +!@ barr
1.9 1.9 1.5 no atomic/no barrier
19.4 20.9 27.9 atomic/barrier

Core i3-1315U E-core (Gracemont):
!@ +!@ barr
2.7 2.2 2.2 no atomic/no barrier
20.6 20.4 20.0 atomic/barrier

On to Apple Silicon (weak memory ordering by default):
Apple M1 P-core (Firestorm):
!@ +!@ barr
3.6 3.6 3.5 no atomic/no barrier
31.9 31.5 3.6 atomic/barrier

Apple M1 E-core (Icestorm):
!@ +!@ barr
3.4 3.4 3.4 no atomic/no barrier
31.4 32.9 6.9 atomic/barrier

On to ARM (weak memory ordering):
RK3588 big (Cortex-A76):
!@ +!@ barr
3.3 3.6 3.3 no atomic/no barrier
20.3 20.4 13.2 atomic/barrier

RK3588 little (Cortex-A55):
!@ +!@ barr
7.2 9.2 7.2 no atomic/no barrier
68.1 57.1 16.2 atomic/barrier

I find the cheapness of the barrier on the M1 surprising. I would
have expected that barriers are more expensive on hardware where the architecture allows more reordering and the hardware makes use of that
license (and I think that the M1 does make use of it).

OTOH, the atomic stuff is more expensive on the Apple M1 and the ARM
cores than on the Intel and AMD cores (note that the cycle times of
the Intel and AMD cores used here is quite a bit shorter than for
Apple and ARM cores, except for Gracemont compared to Firestorm; but
for Firestorm the number of cycles executed is higher, so Gracemont
still takes less time.

In conclusion, as long as we have no contention, atomic accesses and
barriers do not cost hundreds of cycles, but they do cost enough extra
(except the barrier on Firestorm, at least in the present benchmark)
that one does not want to use them across the board, only when
accessing memory that another thread accesses, too. At least in this
sample of cores, the atomic instructions are faster on Intel and AMD
cores than on Apple and ARM cores; for the barrier, the costs are
usually not higher and sometimes significantly cheaper than for the
atomic instructions.

- anton

On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ >(fetch-and-add) costs the following numbers of cycles (including
overhead):

!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic

On a Xeon E-2388G (Rocket Lake):

!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 09:04:51 2026

From Newsgroup: comp.arch

[email protected] (Scott Lurndal) writes:

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC.

I remember listening to a presentation by a student of a collegue
about implementing garbage collection for IIRC big SGI machines. In
addition to LL/SC, they had atomic stuff stuch as fetch-and-add
implemented in the memory subsystem, not in the processor, and that
apparently was needed for contended cases to avoid the round-trip time
through the caches of individual processors. My understanding is
that, while viewed from the perspective of an individual core, the
atomic instructions were slow, the throughput in the contended case
was significantly higher than with LL/SC or an atomic mechanism
implemented in the individual CPUs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 09:12:03 2026

From Newsgroup: comp.arch

EricP <[email protected]> writes:

These benchmarks use per-thread storage: They are single-threaded.

...

They might be allocated in the same cache line.

Given that they are accessed by the same thread, I don't expect that
to hurt, but I did separate the variables by at least 64 bytes in my
recent runs just in case.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 09:14:29 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

On 6/4/2026 2:04 PM, Anton Ertl wrote:

EricP <[email protected]> writes:

On 2026-Jun-03 14:19, Anton Ertl wrote:

variable x 1 x !
variable y -1 y !

...

How is your data organized? Show me the structure?

Shown above. Or, in today's testing:

variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Fri Jun 5 10:20:30 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.

Here's the code:

1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , cache-align here -1 , constant y constant x
[endif]

The part before the [else] is A, comment out "64 allot" for B.

The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Fri Jun 5 13:43:11 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.

Both technical and business reasons.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Fri Jun 5 17:02:24 2026

From Newsgroup: comp.arch

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.
Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@[email protected] to comp.arch on Fri Jun 5 07:07:07 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.

Sun came through Cisco as well, I don't recall which generation of
chips, but I remember their focus was on the interface to memory
itself, targeting radically reduced latency and much higher bandwidth.
We weren't sure they would get their design out the door, and we were
pretty sure indeed that they wouldn't make a good enough embedded
CPU for our purposes. Too big, too hot, too expensive, and so forth.

At that time (MANY years ago now) Cisco's core router OS was big endian
only. That kept us from considering x86.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 14:57:46 2026

From Newsgroup: comp.arch

On 6/5/2026 2:12 AM, Anton Ertl wrote:

EricP <[email protected]> writes:

These benchmarks use per-thread storage: They are single-threaded.

...

They might be allocated in the same cache line.

Given that they are accessed by the same thread, I don't expect that
to hurt, but I did separate the variables by at least 64 bytes in my
recent runs just in case.

Make sure to pad and align the variables on separate cache lines. :^)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:11:22 2026

From Newsgroup: comp.arch

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions
Multi-threaded pad and align to prevent false sharing between
independently accessed variables.

For instance you don't want a mutex word to false share with say an
atomic counter that has nothing to do with the mutex. They need to be
padded and aligned...

Here's the code:

1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , cache-align here -1 , constant y constant x
[endif]

The part before the [else] is A, comment out "64 allot" for B.

The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:27:04 2026

From Newsgroup: comp.arch

On 6/5/2026 12:04 AM, Anton Ertl wrote:

[email protected] (Anton Ertl) writes:

Paul Clayton <[email protected]> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

I have increased the loop count by a factor of 10, because I did not
subtract the startup overhead of Gforth; as a result, the startup
overhead is reduced from 3.3 cycles per execution of the relevant word
to 0.33 cycles.

I have also inserted 64 bytes between the variables, so that they are
in different cache lines. This should not make a difference, because
all accesses are in the same thread (i.e., no cache-ping-pong from
possible false sharing), but just in case.

What I did not do is to use several threads. The idea here is that programmers will take measures that ensure that contention is rare,
but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
barriers have no extra cost, but in reality, they do have extra cost.

Indeed.

[snip results]

Thanks.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:40:13 2026

From Newsgroup: comp.arch

On 6/5/2026 12:04 AM, Anton Ertl wrote:

[email protected] (Anton Ertl) writes:

Paul Clayton <[email protected]> writes:

I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.

I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

[...]

On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
per thread stack location? Iirc some compilers would use a dummy. Oh
shit man, 20+ish years ago I was running all sorts of benchmarks on
MFENCE vs LOCK RMW. Or MFENCE vs MEMBAR #StoreLoad | #LoadStore |
#StoreStore | #LoadLoad on the SPARC. I could not really directly test
LOCK RMW wrt x86 on the SPARC because all of the sparcs aromic RMW's are naked. I would have to manually add the barriers to make it TSO in RMO mode. --- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 15:43:14 2026

From Newsgroup: comp.arch

On 6/5/2026 3:11 PM, Chris M. Thomasson wrote:

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

// padded to a l2 cache line
struct A
{
     unsigned word m_data;
     char padding[...];
};

// padded to a l2 cache line
struct B
{
     unsigned word m_data;
     char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

   !@   +!@ barr
   2.4 2.4 1.8 A B C
   2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

   !@   +!@ barr
   9.3 8.3 7.2 A
   9.2 8.3 7.1 B
   9.2 8.3 8.5-11.2 C
   9.3 8.3 9.1-11   D
   9.1 8.3 7.3-11   E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions

Actually try to avoid LOCK prefixed anything on single threaded... Even
XCHG has that implied LOCK prefix. :^)

Multi-threaded pad and align to prevent false sharing between
independently accessed variables.

For instance you don't want a mutex word to false share with say an
atomic counter that has nothing to do with the mutex. They need to be
padded and aligned...

Here's the code:

1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
     : cache-align here dup 64 naligned >align ;
     cache-align
     here 1 , cache-align here -1 , constant y constant x
[endif]

The part before the [else] is A, comment out "64 allot" for B.

The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 16:06:43 2026

From Newsgroup: comp.arch

On 6/5/2026 7:02 AM, Michael S wrote:

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.

Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up.

Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.

It still needed an explicit membar for a store followed by a load to
another location, even in TSO.

Actually, I forgot how I go some sparcs in RMO mode. PSTATE?

Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store operations operations - they behave as RMO regardles of processor's
global memory mode.

Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 16:08:46 2026

From Newsgroup: comp.arch

On 6/5/2026 4:06 PM, Chris M. Thomasson wrote:

On 6/5/2026 7:02 AM, Michael S wrote:

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>> that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.

Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed
up.

Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.

It still needed an explicit membar for a store followed by a load to
another location, even in TSO.

Actually, I forgot how I go some sparcs in RMO mode. PSTATE?

Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.

Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

I would always program the sparc (ASM using GAS) using the correct
membars in the right places even if on certain modes they would be no-ops.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 5 16:17:05 2026

From Newsgroup: comp.arch

On 6/5/2026 4:06 PM, Chris M. Thomasson wrote:
[...]

Fwiw, the SunFire T2000 was the first sparc box I owned personally. Sun
gave me one in the their CoolThreads contest for my vzoom project. I
have used others before that, but they were not mine.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat Jun 6 01:44:09 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 6/5/2026 7:02 AM, Michael S wrote:

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.

Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up.

Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.

It still needed an explicit membar for a store followed by a load to
another location, even in TSO.

Actually, I forgot how I go some sparcs in RMO mode. PSTATE?

Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store operations operations - they behave as RMO regardles of processor's
global memory mode.

Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

SPARC used nullification in delay slots.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Sat Jun 6 08:14:17 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

On 6/5/2026 12:04 AM, Anton Ertl wrote:

[email protected] (Anton Ertl) writes:
I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as

__atomic_thread_fence(__ATOMIC_SEQ_CST);

The barriers separate loads.

[...]

On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
per thread stack location?

On AMD64, the latter. The code generated by gcc for the line above
is:

lock orq $0x0,(%rsp)

On ARM A64 gcc generates the following:

dmb ish

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Sat Jun 6 08:30:45 2026

From Newsgroup: comp.arch

[email protected] (Anton Ertl) writes:

Anyway, let's see if it makes a difference.

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

[...]

And here are the results (on a Ryzen 8700G):

The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:

!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E

For the atomic/barrier variants the cycles are:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E

The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that.

Now I have: It's the placement of the native code. If I compile
another definition

: dummy1 swap over 2rot ;

that is never called before all the others, the result for D becomes:

!@ +!@ barr
9.3 8.3 7.2 D

with little variation. So it seems that the code placement of the bench-barrier word ran into some microarchitectural hickup of Zen4.

Now that I have that problem worked around, let's see if the data
placement makes a difference:

!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.3 8.3 7.0 C
9.3 8.3 7.2 D
9.3 8.3 7.2 E

Making them adjacent in the same cache line is not disadvantage as
long as there is no actual communication going on. Of course, in an
actual application you want them in different cache lines, because
then you will have communication, or using atomic accesses or barrier
would not be pointless.

Code (with the data part set up for E):

0 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !

[else]
: dummy1 swap over 2rot ;
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , ( cache-align ) 64 allot here -1 , constant y constant x
[endif]

: bench-!@
1 50_000_000 0 do x !@ y !@ loop drop ;

: bench-atomic!@
1 50_000_000 0 do x atomic!@ y atomic!@ loop drop ;

: bench-+!@
1 50_000_000 0 do x +!@ y +!@ loop drop ;

: bench-atomic+!@
1 50_000_000 0 do x atomic+!@ y atomic+!@ loop drop ;

: bench-nobarrier
50_000_000 0 do x @ y @ 2drop loop ;

: bench-barrier
50_000_000 0 do x @ barrier y @ barrier 2drop loop ;

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Sat Jun 6 08:49:06 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> writes:

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

[...]

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

...

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock.

All of the data placement variants use word-aligned words and thus do
not straddle cache lines. But your claim was that one should use only
the first word in a cache line. Avoiding false sharing is important,
if there is any sharing, but that's not the case for this benchmark.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 11:25:17 2026

From Newsgroup: comp.arch

On 6/5/2026 6:44 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/5/2026 7:02 AM, Michael S wrote:

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel >>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>> that it was very likely to scale without undue incremental design
work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.

Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up. >>
Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.

It still needed an explicit membar for a store followed by a load to
another location, even in TSO.

Actually, I forgot how I go some sparcs in RMO mode. PSTATE?

Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented >>> to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.

Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

SPARC used nullification in delay slots.

Iirc, might be wrong here, a MEMBAR can force processor serialization or
stall the pipeline until the store buffers drain, executing it right
when the processor is updating the PC and nPC for a branch created nasty timing hazards? God its been a long time since I read the docs...
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 11:52:09 2026

From Newsgroup: comp.arch

On 6/6/2026 1:49 AM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 6/5/2026 3:20 AM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};

// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};

Where A and B are both aligned up to a l2 cache line boundary? We need >>>> to pad _and_ align...

Why would alignment to cache-line boundaries be necessary?

[...]

A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).

B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).

C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).

D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.

E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).

F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).

...

Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock.

All of the data placement variants use word-aligned words and thus do
not straddle cache lines. But your claim was that one should use only
the first word in a cache line. Avoiding false sharing is important,
if there is any sharing, but that's not the case for this benchmark.

Fair enough! :^) For a single-threaded benchmark with no concurrent
sharing, you are right. The layout variants you described ensure no
single word straddles a cache-line boundary, which completely avoids the split-access or bus-lock penalty on a single core. In that specific
context, packing things tightly is "superior" because my defensive
padding would just bloat the working set and cause unnecessary cache misses.

Fwiw, my advice to align and pad so a variable exclusively owns the
first word of a cache line is a habit born entirely out of
multi-threaded, lock/wait-free architecture design.

Actually, there is a fundamental difference in intent:

Word Alignment: Keeps a single thread from split-concurrency penalties (straddling). No word from cache line A bleeding into cache line B.

Cache-Line Alignment + Padding: Keeps different threads on different
cores from causing hardware cache-coherence storms (false sharing). Very
bad!

If struct A and struct B live in the exact same cache line, they are
safe from straddling. But the moment Core 0 writes to A and Core 1
writes to B, the underlying MESI cache-coherence protocol will violently bounce that single cache line back and forth between L1 caches.

Since your benchmark doesn't have concurrent sharing, you only care
about #1. I default to engineering for #2 defensively because the moment
code scales out to multiple threads, a well-aligned but unpadded
structure can cause performance to fall off a cliff.

Actually, do you remember the thread offset fiasco from Intel? I
remember reading a white paper wrt hyper threading, that the thread
stacks should be offset from each other to avoid false sharing. It was a
work around for a design error, iirc?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 12:03:46 2026

From Newsgroup: comp.arch

On 6/6/2026 11:25 AM, Chris M. Thomasson wrote:

On 6/5/2026 6:44 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/5/2026 7:02 AM, Michael S wrote:

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel >>>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces >>>>>> >from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program, >>>>>>> by implication bet the company) that it would work as efficiently >>>>>>> as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>>> that it was very likely to scale without undue incremental design >>>>>>> work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building >>>>>> the SPP (scalable parallel processor aka OPUS) using motorola 88110 >>>>>> CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ >>>>>> processor SPP. After evaluation, we chose Pentium Pro to build the >>>>>> system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never >>>>>> made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware >>>> were UrtraSPARC and UrtraSPARC II.

Oh shit, I think you are right! I sometimes get my old SPARC boxes
mixed up.

Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.

It still needed an explicit membar for a store followed by a load to
another location, even in TSO.

Actually, I forgot how I go some sparcs in RMO mode. PSTATE?

Starting from UrtraSPARC III Cu, all Sun SPARC processors are
documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.

Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

SPARC used nullification in delay slots.

Iirc, might be wrong here, a MEMBAR can force processor serialization or stall the pipeline until the store buffers drain, executing it right
when the processor is updating the PC and nPC for a branch created nasty timing hazards? God its been a long time since I read the docs...

Or iirc, sometimes in certain use cases, the branch delay slot might not
be executed? Even with programming it directly in ASM and using GAS to assemble it?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Sat Jun 6 12:08:33 2026

From Newsgroup: comp.arch

On 6/6/2026 12:03 PM, Chris M. Thomasson wrote:

On 6/6/2026 11:25 AM, Chris M. Thomasson wrote:

On 6/5/2026 6:44 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/5/2026 7:02 AM, Michael S wrote:

On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:

On 6/4/2026 7:21 AM, Scott Lurndal wrote:

Andy Valencia <[email protected]> writes:

I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.

I was at Sequent when we were really serious about moving off Intel >>>>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces >>>>>>> >from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program, >>>>>>>> by implication bet the company) that it would work as efficiently >>>>>>>> as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>>>> that it was very likely to scale without undue incremental design >>>>>>>> work to ~32 CPU's.

I was at Unisys in that same timeframe; we had planned on building >>>>>>> the SPP (scalable parallel processor aka OPUS) using motorola 88110 >>>>>>> CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ >>>>>>> processor SPP. After evaluation, we chose Pentium Pro to build the >>>>>>> system (using the Intel Paragon backplane).

I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never >>>>>>> made it out of the first evaluation round.

Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.

RMO mode?
I am pretty sure that T2000 had no RMO mode.

If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware >>>>> were UrtraSPARC and UrtraSPARC II.

Oh shit, I think you are right! I sometimes get my old SPARC boxes
mixed up.

Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.

It still needed an explicit membar for a store followed by a load to
another location, even in TSO.

Actually, I forgot how I go some sparcs in RMO mode. PSTATE?

Starting from UrtraSPARC III Cu, all Sun SPARC processors are
documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very >>>>> surprised if it is not the same as UrtraSPARC III Cu.

SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store >>>>> operations operations - they behave as RMO regardles of processor's
global memory mode.

Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?

SPARC used nullification in delay slots.

Iirc, might be wrong here, a MEMBAR can force processor serialization
or stall the pipeline until the store buffers drain, executing it
right when the processor is updating the PC and nPC for a branch
created nasty timing hazards? God its been a long time since I read
the docs...

Or iirc, sometimes in certain use cases, the branch delay slot might not
be executed? Even with programming it directly in ASM and using GAS to assemble it?

Hyper dangerous case. If a MEMBAR instruction is "skipped", then another
one bites the dust! Memory racer!

Fwiw, some tech relief, a song to go with it:

(Queen - Another One Bites The Dust (Official Video))

https://youtu.be/eqyUAtzS_6M?list=RDeqyUAtzS_6M

;^D

Memory race... A song for it.. rofl!

(Charli XCX - Speed Drive (From Barbie The Album) [Official Audio]) https://youtu.be/TxZwCpgxttQ?list=RDTxZwCpgxttQ

Sorry, just a brain coolant. ;^)

--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,123
Nodes:	10 (0 / 10)
Uptime:	34:23:01
Calls:	14,371
Files:	186,380
D/L today:	1,028 files (283M bytes)
Messages:	2,540,614

Re: ARM CAS vs LL/SC

Who's Online

System Info