Paul Clayton <[email protected]> posted:
On 5/11/26 10:38 AM, Scott Lurndal wrote:
For HW to 'recognize' a LL-OP-SC as an idium, it would have to be
'local'. 'Local' probably means that no-to-few instructions that
are not in the LL-OP-SC sequence (with a strong preference for none).
That is: HW car recognize:
LDDL R9,[IP,#48]
ADD R10,R9,#1
STDL R10,[IP,#48] // same address "pattern"
as an idiom fairly easy:: but not:
MOV R8,#48
MOV R6,#48
...
LDDL R9,[IP,R6]
ADD R10,R9,#1
STDL R10,[IP,R8] // same address "pattern" ???
Said recognized idiom can be packaged up and shipped out through
the memory hierarchy as an XADD without much trouble.
Please elaborate. There are few restrictions on the instructions
that lie between the LL and SC instructions - I don't see how
any CPU could translate an arbitrary sequence of instructions
between the LL and SS into an atomic bus operation efficiently.
The hardware implementation can choose which LL/SC-guarded
operations to export,
No, no, no; HW exports LL and SC and SW uses them as it sees fit.
The only choice HW has is whether LL and SC are user instructions.
which to optimize into a fast path within
the processor, and which to treat conventionally. Even in a more
conventional implementation, NAKs or deferred responses might be
used to promote forward progress.
This does require software developers to monitor what
optimizations are implemented, at least if there are
alternatives with possibly more desired performance
characteristics.
Unworkable.
Even with atomic instructions, I get the impression that the
explicit implementation (performance/scaling) is not
architecturally defined.
Scaling is not an Architectural property it is am implementation
property.
An atomic instruction might be
implemented with LL/SC with a guarantee of eventual success
(which would hopefully not be as bad as some x86 global lock for
cache block crossing LOCKed instructions).
You might be surprised at how glacial that eventual success is.
(AArch64's STADD does not guarantee that the addition will be
done in the cache hierarchy even on a cache miss. The
architecture merely guarantees that the operation will be
atomic. An implementation could optimistically use an LL/SC-
based mechanism and fall back to locking rather than just
monitoring the reservation to ensure forward progress. With
out-of-order execution, the actual store to shared memory has
to be delayed until it is no longer speculative anyway,
replaying an atomic operation can be faster than a branch
misprediction — and even a branch misprediction can be fast
compared to communication between caches.)
IMO, LL/SC is an obsolete artifact of the past.
You, I, and Chris seem to agree on this detail.
I disagree. I _feel_ LL/SC is a nice abstract interface that
not only allows high-performance implementations of simple
atomics without requiring new software but can also (in theory)
be extended to multiple reservations (like My 66000's ESM) and
even to very general transactional memory. (I think a better
interface is possible with easier decode, better code density,
and the opportunity for hints and/or directives, but such would
introduce other costs.)
I see specific atomic operations as somewhat attractive (idiom
recognition is nice but it is not free), but potentially
susceptible to an excessive expansion of instructions. (SIMD
has similar tradeoffs. I like SIMD, but it has issues.)
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
CAS failures, I have tested this in the past, will hit the bus
lock and still make forward progress... Sigh... A horrible LL/SC
thing can live lock!
Paul Clayton <[email protected]> writes:
On 5/11/26 10:38 AM, Scott Lurndal wrote:
IME, atomic operations at the instruction set level have
not been implemented with LL/SC, even on architectures
that have LL/SC (or LDEX/STREX). The typical atomic
operations are designed so that they can generate
atomic PCIe (or other on-chip) transactions which cannot be
simulated using LL/SC.
We'll have to agree to disagree. I consider the lack of scalability
of LL/SC to be a fatal defect.
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).
On 5/14/26 11:03 AM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:
On 5/11/26 10:38 AM, Scott Lurndal wrote:
[snip]
IME, atomic operations at the instruction set level have
not been implemented with LL/SC, even on architectures
that have LL/SC (or LDEX/STREX). The typical atomic
operations are designed so that they can generate
atomic PCIe (or other on-chip) transactions which cannot be
simulated using LL/SC.
I seem to recall reading Andy Glew mentioning that an x86
implementation was using such an internal mechanism — and he
expressed concerns about how it would ensure the Architectural
guarantees.
As I wrote before, any simple LL/SC operation that could be
replaced by the compiler with a simple atomic instruction could
be recognized by hardware at a special case for optimization and
made to behave as if it was a single atomic instruction.
[snip]
We'll have to agree to disagree. I consider the lack of scalability
of LL/SC to be a fatal defect.
I believe the lack of scalability is an implementation choice
and allowing that poor scalability is an Architectural choice.
I.e., this is not about the instruction interface so much as
about quality of implementation (and Architectural or "profile"
guarantees).
Maybe practically one cannot trust processor developers (and
those defining the guarantees) to do the extra work to close
that gap. Maybe advertising atomic instructions is more
effective than advertising well-implemented LL/SC. (I am
sufficiently discouraged about human nature and current human
society to believe that "well-implemented LL/SC" is a
cloud-cuckoo-land concept.)
I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other--- Synchronet 3.22a-Linux NewsLink 1.2
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead). Normal loads and stores within the code body would
be "guarded" and the SC could have a different address than the
LL. I.e., forward compatibility would be possible without adding
any Architectural state or new instructions while providing new functionality.
On 5/14/26 11:03 AM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:
On 5/11/26 10:38 AM, Scott Lurndal wrote:
[snip]
IME, atomic operations at the instruction set level have
not been implemented with LL/SC, even on architectures
that have LL/SC (or LDEX/STREX). The typical atomic
operations are designed so that they can generate
atomic PCIe (or other on-chip) transactions which cannot be
simulated using LL/SC.
I seem to recall reading Andy Glew mentioning that an x86
implementation was using such an internal mechanism — and he
expressed concerns about how it would ensure the Architectural
guarantees.
As I wrote before, any simple LL/SC operation that could be
replaced by the compiler with a simple atomic instruction could
be recognized by hardware at a special case for optimization and
made to behave as if it was a single atomic instruction.
[snip]
We'll have to agree to disagree. I consider the lack of scalability
of LL/SC to be a fatal defect.
I believe the lack of scalability is an implementation choice
and allowing that poor scalability is an Architectural choice.
I.e., this is not about the instruction interface so much as
about quality of implementation (and Architectural or "profile"
guarantees).
Maybe practically one cannot trust processor developers (and
those defining the guarantees) to do the extra work to close
that gap. Maybe advertising atomic instructions is more
effective than advertising well-implemented LL/SC. (I am
sufficiently discouraged about human nature and current human
society to believe that "well-implemented LL/SC" is a
cloud-cuckoo-land concept.)
I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.
Paul Clayton <[email protected]> writes:
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).
I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).
It would be limiting to tie LL/SC to cache lines.
Atomics are independent of the cache, and can be used with
both cacheable and non-cacheable memory as well as
CXL and PCI Express devices.
Paul Clayton <[email protected]> posted:
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
It took me an entire year (2000+ hour) to create ASF after knowing
how LL/SC works. The "here is the basic idea" was only a couple of
days--
the rest of the time was making "here are a small number of
cache lines", "make them all available at the same time", in such
a way that "you can make all updates appear system wide in a single
instance" or "make them appear to have never been modified" with
semantics that work EVEN IF YOU DO NOT HAVE A CACHE in the CPU.
Then there is multiple-LL memory order semantics,
detection of interference,
a system arbiter when interference is heavy,
and what to do when interference prevents completion.
LL/SC is easy, compared to making multiple-LL and multiple-SC
work.
Paul Clayton <[email protected]> writes:[snip]
I wish that at least we could agree that simple LL/SC operations
could _theoretically_ provide the same guarantees and
optimization as simple atomic instructions.
Functionality guarantees, yes. Performance has to suffer,
unless the hardware can analyze all the instructions between
the LL/SC and abstract them into a single bus operation; which
I don't see as feasible.
If you can figure out how to implement LL/SC optimally
to CXL remote memory for the same set of atomic operations
provided by PCI express, I'd be interested in the result.
On 5/21/26 4:17 PM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).
Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.
On 5/24/2026 2:24 PM, Paul Clayton wrote:
On 5/21/26 4:17 PM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).
Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.
A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?
FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:
https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
"Chris M. Thomasson" <[email protected]> posted:
On 5/24/2026 2:24 PM, Paul Clayton wrote:
On 5/21/26 4:17 PM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).
Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.
A large granule then we need to worry about a single load from say via
false sharing or something... Well, can that case the SC to fail?
Does this "LL/SC and other core instructions synchronization means" not
fall from "desirable" when one has a complete set of to-memory() atomic actions {add, sub, and, or, xor, xchg, cmp, cas} which avoid all the quadratic and cubic interconnect traffic in the system which are the
real point of slow synchronization ??!!?? while being guaranteed to
work without an interference and can be done for both cacheable and unCacheable memory accesses ??!!??
FWIW, if a "slow path" is hit, wrt RMW based CAS, we can emulate them
using a hashed lock where address of a target word is used to index into
an array. Something akin to:
https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
On 5/21/26 4:17 PM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:
On 5/14/26 3:49 AM, Chris M. Thomasson wrote:
[snip]
Wrt LL/SC, how large is the reservation granule? PPC has some
insight...
Usually the reservation granule is the cache block in order to
exploit existing cache coherence mechanisms.
ARM architectures allow (but don't encourage) a reservation
granule that covers the entire address space (e.g. see the
ARMv7 ARM).
Any larger granule assures correctness but hinders performance.
A global lock works but does not allow much parallelism.
The less specifically the size is defined, the less performance-
portable software becomes. One can address this with something
like RISC-V profiles, in which sizes can be more specific and
software that cares will specify a target profile rather than an
Architecture (version).
Since granule size can influence what code is most efficient,
even recompiling is not an excellent option. So for a class of
applications, having a single target seems to make sense.
Being able to test software on a development machine can also be
useful, so desired performance compatibility might be broader
than a application type.
I feel there is relatively little to prevent LL/SC semantics
from being extended to support multiple cache blocks (or, for
small LL/SC code bodies, single words for conflicts with other
atomic operations — normal loads and stores might still use
cache block granularity to limit complexity and/or network
overhead).
It would be limiting to tie LL/SC to cache lines.
It is not tying the operation to cache lines but to cache
line granules in terms of external interference monitoring
(and, in the case of a modest extension beyond traditional
LL/SC, the scope of the read/write set).
Atomics are independent of the cache, and can be used with
both cacheable and non-cacheable memory as well as
CXL and PCI Express devices.
I am not certain that LL/SC (or an extended form of such)
could not be used with "I/O" addresses. This merely requires
the equivalent of one cache line "cache" (or the largest
guaranteed size of a transaction) and some form of
monitoring ("coherence") of such memory addresses.
In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.
For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order?
My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases.
On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
CAS failures, I have tested this in the past, will hit the bus lock
and still make forward progress... Sigh... A horrible LL/SC thing can
live lock!
LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.
In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.
A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
will complete before the heat death of the universe"☺)
Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the cognitive
load for software developers be acceptable. That
such guarantees seem to be very rare is sad.
How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.
On 5/27/2026 2:08 PM, Chris M. Thomasson wrote:
[...]
How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very
few.
A LOCK XADD can be used for wait free algos, a LOCK XADD emulated with
LL/SC cannot... ?
On 5/20/2026 4:47 PM, Paul Clayton wrote:
On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
CAS failures, I have tested this in the past, will hit the bus lock
and still make forward progress... Sigh... A horrible LL/SC thing can
live lock!
LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.
In my opinion, this is not so much a CAS vs. LL/SC issue as a quality of implementation issue.
Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.
A guarantee of forward progress is not very useful if the progress is glacially (or cosmologically) slow. ("We guarantee that the operation
will complete before the heat death of the universe"☺)
A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is hyper important to help the software pad and align to remove any false sharing
on said granule. No? But...
Here's the deeper problem can rear its ugly head... Vendors often don't document it? Or they document it inconsistently across revisions? So
even if you do everything right in principle, you're tuning against a
number you had to dig out of a forum post or reverse engineer yourself. Scary! ;^o
Of course, the temptation toward "good enough" (not so bad that one will lose too many customers) is a problem. I would expect
documented guarantees of sufficient generality to have the cognitive
load for software developers be acceptable. That
such guarantees seem to be very rare is sad.
How many SC failures on a fetch-and-add are acceptable before you
conclude something's fundamentally broken? For me the answer is: very few.
On 5/20/2026 4:47 PM, Paul Clayton wrote:
On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!
LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.
In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.
Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.
A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)
A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...
Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o
Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.
How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.
Paul Clayton <[email protected]> writes:[snip]
In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.
If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?
For other operations, I am not certain what semantics make
sense. If a read at one address changes the behavior of another
access, does "atomic" behavior mean that the later in program
order access happens before the I/O agent changes the access
behavior or does it mean that the atomic action blocks "ordinary
software agents" but lets side effects caused by the action to
occur in program order?
Atomics ensure that the access is atomic with respect to
all other accessors - ensuring that the other accessors
will not see inconsistent data.
Atomics can be used as a basis (e.g. atomic test&set) to
guard a critical section, but they're also useful for
adjusting shared counters et alia.
My perception is that PCI-E atomics are not meant for
non-idempotent storage. (I do not know how ARM atomic
instructions handle such cases.
See above.
On 5/27/26 10:25 AM, Scott Lurndal wrote:
Paul Clayton <[email protected]> writes:[snip]
In the case of a simple operation, as has been stated before,
the LL/SC sequence can be converted to the equivalent of an
atomic instruction.
If true in the general case (and I'm not sure I see how it
can be), why bother to add the hardware to do so when
atomics are generally superior, scalable, simpler to implement and
higher performance?
A more generic interface has some advantages.
I already mentioned that old software that was developed when
there was not an atomic ["expensive" operation] instruction
could benefit from idiom recognition on new hardware. (An
alternative to that would be patching or recompiling the
software. While I prefer a more abstract software distribution
format for its ability to avoid having to move things to
Architecture and even potentially perform microarchitectural
optimizations at non-instruction granularity, such seems
unlikely to be common any time soon.)
Even with atomic instructions, the Architecture generally does
not provide guarantees about scalability. I doubt any
implementation would stop-the-world to perform an atomic
operation (because the performance penalty would be quite
noticeable), but I can easily imagine an implementation
waiting until the atomic operation is not speculative before
starting it.
On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
On 5/20/2026 4:47 PM, Paul Clayton wrote:
On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!
LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.
In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.
Well, making a LOCK CAS, or say LOCK XADD, has certain inherent guarantees. Using LL/SC to emulate them is a different story.
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.
A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)
A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...
I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.
(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)
I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.
Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o
Ugh!
Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.
Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.
How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.
Again, I think this is concerned with "quality of--- Synchronet 3.22a-Linux NewsLink 1.2
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.
But note: XADD [...] never causes more than necessary bus traffic
and as an atomic event, never fails, never needs retry, ...
On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]
But note: XADD [...] never causes more than necessary bus traffic
I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).
and as an atomic event, never fails, never needs retry, ...
I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,
even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.
Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.
Constrained transactions had these restrictions (from https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):I used a timer--to the same ends.
| - The transaction executes no more than 32 instructions.
| - All instructions within the transaction must be within 256I allow calls to subroutines in the event.
| contiguous bytes of storage.
| - The only branches you may use are relative branches thatLoops are OK as long as the timer does not go off.
| branch forward (so there can be no loops).
| - All SS and SSE-format instructions may not be used.Agreed.
| - Additional general instructions may not be used.I see no reason to limit general calculations and memory access.
| - The transaction's storage operands may not access more than8 cache lines participate, an unbounded number of cache lines
| four octowords.
| - The transaction may not access storage operands in any 4 |K-interdesting.
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.
| - Operand references must be within a single doubleword,Any normal memory references to the participating lines.
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.
I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???
I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)
I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.
There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.
With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.
I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.
Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.
Paul Clayton <[email protected]> posted:
On 5/27/26 5:08 PM, Chris M. Thomasson wrote:
On 5/20/2026 4:47 PM, Paul Clayton wrote:
On 5/14/26 3:58 AM, Chris M. Thomasson wrote:
CAS failures, I have tested this in the past, will hit the
bus lock and still make forward progress... Sigh... A
horrible LL/SC thing can live lock!
LL/SC live lock is implementation dependent. One could
Architecturally guarantee forward progress for the kind of cases
where CAS would be an alternative.
In my opinion, this is not so much a CAS vs. LL/SC issue as a
quality of implementation issue.
Well, making a LOCK CAS, or say LOCK XADD, has certain inherent
guarantees. Using LL/SC to emulate them is a different story.
Academic LL/SC: I can agree with this statement. But neither ASF nor
ESM has problems making stronger guarantees--and I did this over
{7 ASF, 8 ESM} cache lines not 1 single memory location. These aslo
impose limitation on instruction order and SW has to understand
several nonVoneumann properties of the ATOMIC event.
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
That standard academic stuff cannot, does not mean it absolutely
cannot be done.
IBM's constrained
transactions guaranteed success of a transaction if it met
certain criteria. A single-instruction LL/SC body could be
Architecturally guaranteed to perform not only successfully but
with some performance characteristics.
A guarantee of forward progress is not very useful if the
progress is glacially (or cosmologically) slow. ("We guarantee
that the operation will complete before the heat death of the
universe"☺)
A _guarantee_ of forward progress is ALWAYS important? Sorry for
shouting. Shit. Knowing the size of the reservation granule is
hyper important to help the software pad and align to remove any
false sharing on said granule. No? But...
I disagree. A guarantee that has a time scale beyond human
civilization much less the lifetime of the hardware seems to
have extremely little use. It may be reasonable to assume
reasonable timescales for such guarantees, but a simple
guarantee of eventual completion (if the system is kept
operating) might be given if the profit motive seems sufficient.
(I am not certain if even x86 XLOCK operations are absolutely
guaranteed to complete in the presence of context switches. A
hardware thread might be always be interrupted while it is
performing the operation and if the hardware does not delay
interrupt handling until after the operation completes, then the
operation may never complete. This may be so extraordinarily
improbable that an undetected error in ECC-protected memory
might be more likely, in which case it is not really important.)
I think one really wants the time scale explicitly declared as
well as information about the range of latency and causes. Even
5ms latency can seem like forever.
Here's the deeper problem can rear its ugly head... Vendors
often don't document it? Or they document it inconsistently
across revisions? So even if you do everything right in
principle, you're tuning against a number you had to dig out of
a forum post or reverse engineer yourself. Scary! ;^o
Ugh!
Architecting a lot of such factors might help with documentation
as Architecture is more stable than microarchitecture, but I do
not think typical companies have the incentives for excellence
in documentation. If the only consequence of mistakes in
Architectural documentation is a few software developers
grumbling, keeping even such stable documentation consistent and
correct (and abiding by the old/existing Architectural contract)
seems unlikely to seem important. In fact, if the inability to
optimize forces people to buy more (or more expensive) hardware,
poor documentation can mean higher profits.
It took me more than 35 years to learn how to write µArchitecture
documents such that a malevolent engineer could not misunderstand
what was written and specified. Try it, it is not easy. It is not
something that can be taught, but it is something that diligence
and perseverance can deliver.
Of course, the temptation toward "good enough" (not so bad
that one will lose too many customers) is a problem. I would
expect
documented guarantees of sufficient generality to have the
cognitive load for software developers be acceptable. That
such guarantees seem to be very rare is sad.
How many SC failures on a fetch-and-add are acceptable before
you conclude something's fundamentally broken? For me the answer
is: very few.
How many SC failures are acceptable if there are 1024 cores all
going after the same lock ??
Again, I think this is concerned with "quality of
implementation" (and Architectural guarantees about such) than
about the interface at an instruction level.
Paul Clayton <[email protected]> posted:
On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]
But note: XADD [...] never causes more than necessary bus traffic
I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).
The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.
The interconnect is going to route the package towards either a
cache having write permission or a control register.
The cache or control register will perform the packaged calculation
and optionally send back the previous value.
The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either directions.
and as an atomic event, never fails, never needs retry, ...
I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,
If so, you will be surprised when you implement one.
even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.
Where it becomes cubically harder.
Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.
SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.
Constrained transactions had these restrictions (fromI used a timer--to the same ends.
https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-transactions):
| - The transaction executes no more than 32 instructions.
| - All instructions within the transaction must be within 256I allow calls to subroutines in the event.
| contiguous bytes of storage.
| - The only branches you may use are relative branches thatLoops are OK as long as the timer does not go off.
| branch forward (so there can be no loops).
| - All SS and SSE-format instructions may not be used.Agreed.
| - Additional general instructions may not be used.I see no reason to limit general calculations and memory access.
| - The transaction's storage operands may not access more than8 cache lines participate, an unbounded number of cache lines
| four octowords.
can be accessed as long as participants is no larger than 8.
| - The transaction may not access storage operands in any 4 |K-interdesting.
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.
| - Operand references must be within a single doubleword,Any normal memory references to the participating lines.
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.
I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???
I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)
If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).
I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.
If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.
There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.
The thing that makes this so difficult is that most µArchitectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.
My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.
At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.
With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.
It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory coherence protocols are also insufficient.
I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.
See Lamport...
Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.
...
On 6/2/2026 12:36 PM, MitchAlsup wrote:
Paul Clayton <[email protected]> posted:
On 6/1/26 9:27 PM, MitchAlsup wrote:
[snip]
But note: XADD [...] never causes more than necessary bus traffic
I am skeptical that this is Architecturally guaranteed. It may
fall out of any even semi-sane implementation, in which case
programmers might be willing to take it as guaranteed. Yet I
suspect "sanity" may not be reliable with changing tradeoffs
(including whether protecting a company's reputation has value).
The core is going to package this instruction up and ship it
across the interconnect as a fire-and-forget transaction.
The interconnect is going to route the package towards either a
cache having write permission or a control register.
The cache or control register will perform the packaged calculation
and optionally send back the previous value.
The core receives the optional previous value and the memory-atomic
is complete:: 2 interconnect messages, both smaller than a cache line,
not cache lines are moved, and the calculation cannot fail. The only
failure mode is if the interconnect message fails ECC check in either
directions.
and as an atomic event, never fails, never needs retry, ...
I believe an optimistic concurrency interface (LL/SC, ASF, ESM,
etc.) could provide such guarantees,
If so, you will be surprised when you implement one.
even extending to multiple
contiguous instructions operating on data within an aligned
64-byte region.
Where it becomes cubically harder.
Interestingly, it seems that IBM's z17 is the last
implementation to support constrained transactions. I do wonder
why this feature has been removed from the Architecture.
SW TM wants the TM model to support an unbounded number of memory
elements in the single transaction. HW does not do unbounded.
Constrained transactions had these restrictions (fromI used a timer--to the same ends.
https://www.ibm.com/docs/en/zos/3.1.0?topic=execution-constrained-
transactions):
| - The transaction executes no more than 32 instructions.
| - All instructions within the transaction must be within 256I allow calls to subroutines in the event.
| contiguous bytes of storage.
| - The only branches you may use are relative branches thatLoops are OK as long as the timer does not go off.
| branch forward (so there can be no loops).
| - All SS and SSE-format instructions may not be used.Agreed.
| - Additional general instructions may not be used.I see no reason to limit general calculations and memory access.
| - The transaction's storage operands may not access more than8 cache lines participate, an unbounded number of cache lines
| four octowords.
can be accessed as long as participants is no larger than 8.
| - The transaction may not access storage operands in any 4 |K-interdesting.
| byte blocks that contain the 256 bytes of storage beginning
| with the TBEGINC instruction.
| - Operand references must be within a single doubleword,Any normal memory references to the participating lines.
| except for some of the "multiple" instructions for which the
| limitation is a single octoword.
I think I read that the first implementation made an optimistic
attempt and later — I do not remember if multiple optimistic
attempts were made — a hardware lock was used. Perhaps four
addresses cause too much of a slowdown when there is conflict???
I believe that guaranteeing completion would be substantially
easier with only one aligned 64-byte region. (As I think I
wrote before, adding a single "word" exportable atomic operation
in a different "cache block" _might_ be practical to implement
though I did not have an idea for software would express such.
I may be wrong that appending such an exportable operation would
not make ensuring completion significantly more difficult.)
If you take the necessary 6 months to slug through all issues
you can find solutions for the disjoint participants to be at
least as large as the outstanding Miss Buffer size (or MB-1).
I think such guaranteed atomic sequences would require a
distinct instruction not only to allow what IBM did (making such
an illegal/faulting instruction) but also to fault when the
instruction is misused since no fallback path is provided.
If you do it right, your architecture sets up failure paths,
so that if failure happens, IP reverts to the failure point
without executing a branch instruction. I have an instruction
that samples 'interference' and changes the failure point as
a necessary addition. Any interrupt or exception transfers
control to failure point before performing exception control
transfer.
There also seem to be other operations that would not (I think)
be exceptionally difficult to guarantee. E.g., swapping cache
blocks might not be much more difficult to guarantee than quick
operations within a single cache block, though I do not know
how useful such an unconditional swap would be. Atomic cache
block copy would seem to be easier (it is similar to a block
zeroing instruction except that the value is taken from a block
that is not writeable by other agents being in exclusive or
shared state). Guaranteeing atomicity for a copy into a cache
block (where two contiguous cache blocks might be in the read
set and the write is only to part of a cache block) seems a
little more complicated.
The thing that makes this so difficult is that most µArchitectures
cannot guarantee that 2 cache lines are ever simultaneously present
in the cache. ASF and ESM have means to do this which greatly
strengthens the guarantee of forward progress.
My 66000 includes priority in memory transactions, and this enables
the cache with write permission to determine to allow the request
or to fail the request (request is at equal or lower priority) thus
allowing the higher priority ATOMIC event to make forward progress
at the expense of the lower priority event.
At certain times the core may be in a position where it can finish
an event if the cache lines can e guaranteed. During this period,
a core can NaK a request so that the event is guaranteed to finish.
With conventional cache coherence, partial writes seem likely to
be complex. If masked cache block updates were possible as an
exportable atomic operation, it might be practical to lock (NAK-
guard) a limited read set and push the update to the owner. I do
not know if such an update independent of previous values in the
written cache block would be useful.
It is much worse than that in practice. The interconnect protocol and
the cache coherence model HAVE to HAVE ATOMIC event forward progress
fully integrated. MESI and MOESI are insufficient here; most directory
coherence protocols are also insufficient.
I am certainly not comfortable thinking about the visibility/
ordering constraints, so my guesses are very wrong about what is
practical to guarantee as atomic.
See Lamport...
Even if an operation can practically be guaranteed, it may not
be worthwhile to provide an interface that allows requesting
such a guaranteed atomic operation.
...
Well, we can do something... we know that lock cmpxchg8b on a 32 bit
system can handle two adjacent cache lines. So, we can try to hold more
than that, but! its not ideal. For instance my multex can do it and
emulate it. Read all https://groups.google.com/g/comp.lang.c++/c/ sV4WC_cBb9Q/m/SkSqpSxGCAAJ
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.
Paul Clayton <[email protected]> writes:
I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.
Let's see:
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;
: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
On 6/3/2026 11:19 AM, Anton Ertl wrote:
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;
: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
Hammering a single location is going to be bad for LL/SC or LOCK RMW, >regardless of the ins and outs of LL/SC vs LOCK RMW.
Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread >counter and summing them when we need to observe the actual count?
"Chris M. Thomasson" <[email protected]> writes:
On 6/3/2026 11:19 AM, Anton Ertl wrote:
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;
: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.
It's two locations in these benchmarks: X and Y.
Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread
counter and summing them when we need to observe the actual count?
These benchmarks use per-thread storage: They are single-threaded.
On 6/3/2026 1:53 PM, Anton Ertl wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 6/3/2026 11:19 AM, Anton Ertl wrote:
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;
: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.
It's two locations in these benchmarks: X and Y.
Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread >>> counter and summing them when we need to observe the actual count?
These benchmarks use per-thread storage: They are single-threaded.
Humm... I missed that. Anyway, you need to test them multi threaded...
Say our counters are per thread so an increment adds to its per-thread counter instead of using a LOCK RMW. Then when the counter needs to be sampled we can start summing up the per thread counts...
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.
Paul Clayton <[email protected]> writes:
I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.
Let's see:
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;
: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ (fetch-and-add) costs the following numbers of cycles (including
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
- anton
"Chris M. Thomasson" <[email protected]> writes:
On 6/3/2026 11:19 AM, Anton Ertl wrote:
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
: bench-+!@
1 5000000 0 do x +!@ y +!@ loop drop ;
: bench-atomic+!@
1 5000000 0 do x atomic+!@ y atomic+!@ loop drop ;
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@
(fetch-and-add) costs the following numbers of cycles (including
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
Hammering a single location is going to be bad for LL/SC or LOCK RMW,
regardless of the ins and outs of LL/SC vs LOCK RMW.
It's two locations in these benchmarks: X and Y.
Its up to the
programmer to make sure that is amortized, distributed in clever ways.
For instance, why use a single atomic counter, vs say using a per thread
counter and summing them when we need to observe the actual count?
These benchmarks use per-thread storage: They are single-threaded.
- anton
On 2026-Jun-03 14:19, Anton Ertl wrote:...
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.
EricP <[email protected]> writes:
On 2026-Jun-03 14:19, Anton Ertl wrote:...
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.
The code for "x !@" is:
mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)
while the code for "x atomic!@" is:
mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13
As you can see, there is no XCHG in the !@ code.
EricP <[email protected]> writes:
On 2026-Jun-03 14:19, Anton Ertl wrote:...
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.
The code for "x !@" is:
mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)
while the code for "x atomic!@" is:
mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13
As you can see, there is no XCHG in the !@ code.
On 6/4/2026 2:04 PM, Anton Ertl wrote:
EricP <[email protected]> writes:
On 2026-Jun-03 14:19, Anton Ertl wrote:...
variable x 1 x !
variable y -1 y !
: bench-!@
1 5000000 0 do x !@ y !@ loop drop ;
: bench-atomic!@
1 5000000 0 do x atomic!@ y atomic!@ loop drop ;
On x86/x64 the exchange instruction XCHG has inplied LOCK prefix
whether it is specified or not. In your example both are atomic.
The code for "x !@" is:
mov 0x8(%rbx),%r15
mov %r13,%rax
mov (%r15),%r13
mov %rax,(%r15)
while the code for "x atomic!@" is:
mov %r13,(%r10)
sub $0x8,%r10
mov 0x8(%rbx),%r13
mov 0x8(%r10),%rax
add $0x8,%r10
xchg %rax,0x0(%r13)
mov %rax,%r13
As you can see, there is no XCHG in the !@ code.
How is your data organized? Show me the structure?
Paul Clayton <[email protected]> writes:
I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.
On a Ryzen 8700G (Zen4) each execution of a !@ (exchange) or +!@ >(fetch-and-add) costs the following numbers of cycles (including--
overhead):
!@ +!@
7.5 7.3 not atomic
14.2 13.2 atomic
On a Xeon E-2388G (Rocket Lake):
!@ +!@
8.5 7.1 not atomic
25.8 26.6 atomic
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC.
...These benchmarks use per-thread storage: They are single-threaded.
They might be allocated in the same cache line.
On 6/4/2026 2:04 PM, Anton Ertl wrote:...
EricP <[email protected]> writes:
On 2026-Jun-03 14:19, Anton Ertl wrote:
variable x 1 x !
variable y -1 y !
How is your data organized? Show me the structure?
// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};
// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};
Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we investigated
MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor
SPP. After evaluation, we chose Pentium Pro to build the system
(using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
I don't recall the details of the MIPS evaluation, but we were concerned
at the time about the scalability of LL/SC. SPARC never made it out
of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was pretty
fast for certain worksets and algorithms. RMO mode.
EricP <[email protected]> writes:
...These benchmarks use per-thread storage: They are single-threaded.
They might be allocated in the same cache line.
Given that they are accessed by the same thread, I don't expect that
to hurt, but I did separate the variables by at least 64 bytes in my
recent runs just in case.
"Chris M. Thomasson" <[email protected]> writes:
// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};
// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};
Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...
Why would alignment to cache-line boundaries be necessary?
Anyway, let's see if it makes a difference.
A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).
B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).
C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).
D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.
E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).
F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).
And here are the results (on a Ryzen 8700G):
The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:
!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E
For the atomic/barrier variants the cycles are:
!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E
The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.
Here's the code:
1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !
[else]
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , cache-align here -1 , constant y constant x
[endif]
The part before the [else] is A, comment out "64 allot" for B.
The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.
[email protected] (Anton Ertl) writes:
Paul Clayton <[email protected]> writes:
I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.
I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as
__atomic_thread_fence(__ATOMIC_SEQ_CST);
The barriers separate loads.
I have increased the loop count by a factor of 10, because I did not
subtract the startup overhead of Gforth; as a result, the startup
overhead is reduced from 3.3 cycles per execution of the relevant word
to 0.33 cycles.
I have also inserted 64 bytes between the variables, so that they are
in different cache lines. This should not make a difference, because
all accesses are in the same thread (i.e., no cache-ping-pong from
possible false sharing), but just in case.
What I did not do is to use several threads. The idea here is that programmers will take measures that ensure that contention is rare,
but you still need to use atomic instructions and barriers to ensure correctness. Ideally in this case the atomic instructions and
barriers have no extra cost, but in reality, they do have extra cost.
[email protected] (Anton Ertl) writes:[...]
Paul Clayton <[email protected]> writes:
I seem to recall reading that x86's LOCK instructions take
hundreds of cycles. While some of this is probably from stronger
memory ordering guarantees, I get the impression that the
operation itself is not aggressively optimized.
I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as
__atomic_thread_fence(__ATOMIC_SEQ_CST);
The barriers separate loads.
On 6/5/2026 3:20 AM, Anton Ertl wrote:
"Chris M. Thomasson" <[email protected]> writes:
// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};
// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};
Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...
Why would alignment to cache-line boundaries be necessary?
Anyway, let's see if it makes a difference.
A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).
B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).
C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).
D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.
E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).
F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).
And here are the results (on a Ryzen 8700G):
The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:
!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E
For the atomic/barrier variants the cycles are:
!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E
The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that. The other columns show only small variations.
In any case the aligning and padding recommended by you is not
superior to the original code, which just uses two variables.
Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock. Single-threaded avoid straddling cache line boundaries to prevent bus locks on LOCK prefixed instructions
Multi-threaded pad and align to prevent false sharing between
independently accessed variables.
For instance you don't want a mutex word to false share with say an
atomic counter that has nothing to do with the mutex. They need to be
padded and aligned...
Here's the code:
1 [if]
variable x 1 x !
64 allot \ make sure the variables are in different cache lines
variable y -1 y !
[else]
: cache-align here dup 64 naligned >align ;
cache-align
here 1 , cache-align here -1 , constant y constant x
[endif]
The part before the [else] is A, comment out "64 allot" for B.
The part after the [else] is D, delete the second CACHE-ALIGN for C,
and replace it with "64 allot" for E.
On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
RMO mode?
I am pretty sure that T2000 had no RMO mode.
If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.
Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very surprised if it is not the same as UrtraSPARC III Cu.
SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store operations operations - they behave as RMO regardles of processor's
global memory mode.
On 6/5/2026 7:02 AM, Michael S wrote:
On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>> that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
RMO mode?
I am pretty sure that T2000 had no RMO mode.
If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.
Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed
up.
Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.
It still needed an explicit membar for a store followed by a load to
another location, even in TSO.
Actually, I forgot how I go some sparcs in RMO mode. PSTATE?
Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.
SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.
Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?
On 6/5/2026 7:02 AM, Michael S wrote:
On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel
onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And
that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
RMO mode?
I am pretty sure that T2000 had no RMO mode.
If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.
Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up.
Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.
It still needed an explicit membar for a store followed by a load to
another location, even in TSO.
Actually, I forgot how I go some sparcs in RMO mode. PSTATE?
Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very surprised if it is not the same as UrtraSPARC III Cu.
SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store operations operations - they behave as RMO regardles of processor's
global memory mode.
Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?
On 6/5/2026 12:04 AM, Anton Ertl wrote:
[email protected] (Anton Ertl) writes:[...]
I have revised the benchmarks as follows: I have added a test of a
memory barrier, which is implemented in GNU C as
__atomic_thread_fence(__ATOMIC_SEQ_CST);
The barriers separate loads.
On x86, well, did it fall back to MFENCE? Or use a dummy LOCK RMW on a
per thread stack location?
Anyway, let's see if it makes a difference.[...]
A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).
B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).
C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).
D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.
E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).
And here are the results (on a Ryzen 8700G):
The cycles per execution of the relevant word for the
no-atomic/no-barrier variants are:
!@ +!@ barr
2.4 2.4 1.8 A B C
2.4 2.4 1.9 D E
For the atomic/barrier variants the cycles are:
!@ +!@ barr
9.3 8.3 7.2 A
9.2 8.3 7.1 B
9.2 8.3 8.5-11.2 C
9.3 8.3 9.1-11 D
9.1 8.3 7.3-11 E
The variatons for the barrier column are small for A and B (in the
range 6.9-7.2), and quite a bit larger for C-E, and I have no
explanation for that.
On 6/5/2026 3:20 AM, Anton Ertl wrote:[...]
"Chris M. Thomasson" <[email protected]> writes:
// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};
// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};
Where A and B are both aligned up to a l2 cache line boundary? We need
to pad _and_ align...
Why would alignment to cache-line boundaries be necessary?
...A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).
B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).
C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).
D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.
E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).
F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).
Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock.
"Chris M. Thomasson" <[email protected]> posted:
On 6/5/2026 7:02 AM, Michael S wrote:SPARC used nullification in delay slots.
On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel >>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces
from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program,
by implication bet the company) that it would work as efficiently
as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>> that it was very likely to scale without undue incremental design
work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building
the SPP (scalable parallel processor aka OPUS) using motorola 88110
CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+
processor SPP. After evaluation, we chose Pentium Pro to build the
system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never
made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
RMO mode?
I am pretty sure that T2000 had no RMO mode.
If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware
were UrtraSPARC and UrtraSPARC II.
Oh shit, I think you are right! I sometimes get my old SPARC boxes mixed up. >>
Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.
It still needed an explicit membar for a store followed by a load to
another location, even in TSO.
Actually, I forgot how I go some sparcs in RMO mode. PSTATE?
Starting from UrtraSPARC III Cu, all Sun SPARC processors are documented >>> to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.
SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.
Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?
"Chris M. Thomasson" <[email protected]> writes:
On 6/5/2026 3:20 AM, Anton Ertl wrote:[...]
"Chris M. Thomasson" <[email protected]> writes:
// padded to a l2 cache line
struct A
{
unsigned word m_data;
char padding[...];
};
// padded to a l2 cache line
struct B
{
unsigned word m_data;
char padding[...];
};
Where A and B are both aligned up to a l2 cache line boundary? We need >>>> to pad _and_ align...
Why would alignment to cache-line boundaries be necessary?
...A) Word-aligned variable, 64 byte padding, another word-aligned
variable (what I measured and posted today). A variable takes space
not just for the data (one word), but also for the metadata (and the
metadata is adjacent to the data).
B) Word-aligned variables, no padding, word-aligned variable, with the
two data words maybe in the same cache line, maybe not (measured
yesterday).
C) Cache-line-aligned word, no padding, another cache-line-aligned
word (i.e., both words in the same cache line).
D) Cache-line-aligned word, (56 bytes of) padding, another
cache-line-aligned word.
E) Cache-line-aligned word, 64 bytes padding, another word (i.e., the
second word is aligned like in C).
F) Word at offset 8 from a cache-line start, 48 bytes padding, another
word (cache-line-aligned).
Well, its mainly for false sharing in a multi threading environment. But
it does matter a bit. If your variables straddle a cache line then it
will trigger a bus lock.
All of the data placement variants use word-aligned words and thus do
not straddle cache lines. But your claim was that one should use only
the first word in a cache line. Avoiding false sharing is important,
if there is any sharing, but that's not the case for this benchmark.
On 6/5/2026 6:44 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/5/2026 7:02 AM, Michael S wrote:SPARC used nullification in delay slots.
On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel >>>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces >>>>>> >from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program, >>>>>>> by implication bet the company) that it would work as efficiently >>>>>>> as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>>> that it was very likely to scale without undue incremental design >>>>>>> work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building >>>>>> the SPP (scalable parallel processor aka OPUS) using motorola 88110 >>>>>> CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ >>>>>> processor SPP. After evaluation, we chose Pentium Pro to build the >>>>>> system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never >>>>>> made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
RMO mode?
I am pretty sure that T2000 had no RMO mode.
If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware >>>> were UrtraSPARC and UrtraSPARC II.
Oh shit, I think you are right! I sometimes get my old SPARC boxes
mixed up.
Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.
It still needed an explicit membar for a store followed by a load to
another location, even in TSO.
Actually, I forgot how I go some sparcs in RMO mode. PSTATE?
Starting from UrtraSPARC III Cu, all Sun SPARC processors are
documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very
surprised if it is not the same as UrtraSPARC III Cu.
SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store
operations operations - they behave as RMO regardles of processor's
global memory mode.
Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?
Iirc, might be wrong here, a MEMBAR can force processor serialization or stall the pipeline until the store buffers drain, executing it right
when the processor is updating the PC and nPC for a branch created nasty timing hazards? God its been a long time since I read the docs...
On 6/6/2026 11:25 AM, Chris M. Thomasson wrote:
On 6/5/2026 6:44 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/5/2026 7:02 AM, Michael S wrote:SPARC used nullification in delay slots.
On Thu, 4 Jun 2026 18:28:43 -0700
"Chris M. Thomasson" <[email protected]> wrote:
On 6/4/2026 7:21 AM, Scott Lurndal wrote:
Andy Valencia <[email protected]> writes:
I do not think it is impossible for an architecture to make
guarantees about LL/SC operations.
I was at Sequent when we were really serious about moving off Intel >>>>>>>> onto MIPS. We looked at LL/SC really, really hard. Lock traces >>>>>>> >from current systems, SW simulations, down to gate-level
simulations.
We ended up being sufficiently confident (as in, bet the program, >>>>>>>> by implication bet the company) that it would work as efficiently >>>>>>>> as our current Intel atomics at up to 8-way 64-bit MIPS CPU's. And >>>>>>>> that it was very likely to scale without undue incremental design >>>>>>>> work to ~32 CPU's.
I was at Unisys in that same timeframe; we had planned on building >>>>>>> the SPP (scalable parallel processor aka OPUS) using motorola 88110 >>>>>>> CPUs, until Apple went PPC and Moto canceled 88110. So we
investigated MIPS, SPARC and Pentium Pro. Our target was for a 64+ >>>>>>> processor SPP. After evaluation, we chose Pentium Pro to build the >>>>>>> system (using the Intel Paragon backplane).
I don't recall the details of the MIPS evaluation, but we were
concerned at the time about the scalability of LL/SC. SPARC never >>>>>>> made it out of the first evaluation round.
Why? I had a SunFire T2000 that, when programmed correctly, was
pretty fast for certain worksets and algorithms. RMO mode.
RMO mode?
I am pretty sure that T2000 had no RMO mode.
If I am not mistaken, the only Sun SPARC CPUs that had RMO in hardware >>>>> were UrtraSPARC and UrtraSPARC II.
Oh shit, I think you are right! I sometimes get my old SPARC boxes
mixed up.
Iirc, UltraSPARC T1 was a full SPARC V9 implementation, and SPARC V9
defines three memory models: TSO, PSO, and RMO.
It still needed an explicit membar for a store followed by a load to
another location, even in TSO.
Actually, I forgot how I go some sparcs in RMO mode. PSTATE?
Starting from UrtraSPARC III Cu, all Sun SPARC processors are
documented
to be TSO-only. The processor, for which I didn't find a definite
statement is an original UrtraSPARC III (Chitah), but I would be very >>>>> surprised if it is not the same as UrtraSPARC III Cu.
SPARC-T line (originaaly named Niagara) was TSO-only from the very
start.
The only remnant of RMO in these processors are Block load and store >>>>> operations operations - they behave as RMO regardles of processor's
global memory mode.
Remember that old thing in one of the SPARC docs that explicitly
mentioned to NEVER put a MEMBAR instruction in the branch delay slot?
Iirc, might be wrong here, a MEMBAR can force processor serialization
or stall the pipeline until the store buffers drain, executing it
right when the processor is updating the PC and nPC for a branch
created nasty timing hazards? God its been a long time since I read
the docs...
Or iirc, sometimes in certain use cases, the branch delay slot might not
be executed? Even with programming it directly in ASM and using GAS to assemble it?
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,123 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 34:23:01 |
| Calls: | 14,371 |
| Files: | 186,380 |
| D/L today: |
1,028 files (283M bytes) |
| Messages: | 2,540,614 |