Forum: War Ensemble BBS

Synchronization

From MitchAlsup@[email protected] to comp.arch on Wed May 27 21:44:39 2026

From Newsgroup: comp.arch

This is mainly for Chris M. Thomasson but everyone is free to pile on::

Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties
see that the unit of work was either never started or finished completely without any intermediate state being observed by any party other than the
core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} fall under this definition.

define: memory-atomic as a unit of work identified in a core and packaged
up to be performed by the cache hierarchy or PCIe device control registers. Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- of-work are performed in the cache having write-permission {EXCLUSIVE or MODIFIED cache line states}. When unCacheable, these units of work are performed directly on the addressed control register. In both cases, any interested party sees only the previous state or the after state.

Core-atomic events have the cache hierarchy performing a number (often quadratic) cache line transfers to satisfy a single event ({failing
(n-1) others when under contention}.

Memory-atomic events cannot escape the interconnect latency overhead
but move no data around that does not have to be moved.

Question: If you were starting lout with a clean slate <as it were>,
and not knowing if the atomic-shared-library is to be used in places
where core count was unknown AND interference level was unknown::

Would you implement the atomic-shared-library using core-atomic inst-
ructions or would you implement the library using memory-atomic inst-
ructions ?? Or would you implement some of the library as core-atomic
and other parts as memory-atomic?? and why!

Auxiliary question: Given the definition <above> of memory-atomic::
are not all to-memory packaged events not ATOMIC in the Lamport sense
as long as the package leaves the core sequentially consistent ??

Thanks
Mitch
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Thu May 28 12:37:14 2026

From Newsgroup: comp.arch

On 5/27/2026 2:44 PM, MitchAlsup wrote:

This is mainly for Chris M. Thomasson but everyone is free to pile on::

Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties
see that the unit of work was either never started or finished completely without any intermediate state being observed by any party other than the core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} fall under this definition.

define: memory-atomic as a unit of work identified in a core and packaged
up to be performed by the cache hierarchy or PCIe device control registers. Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- of-work are performed in the cache having write-permission {EXCLUSIVE or MODIFIED cache line states}. When unCacheable, these units of work are performed directly on the addressed control register. In both cases, any interested party sees only the previous state or the after state.

Core-atomic events have the cache hierarchy performing a number (often quadratic) cache line transfers to satisfy a single event ({failing
(n-1) others when under contention}.

Memory-atomic events cannot escape the interconnect latency overhead
but move no data around that does not have to be moved.

Question: If you were starting lout with a clean slate <as it were>,
and not knowing if the atomic-shared-library is to be used in places
where core count was unknown AND interference level was unknown::

Would you implement the atomic-shared-library using core-atomic inst- ructions or would you implement the library using memory-atomic inst- ructions ?? Or would you implement some of the library as core-atomic
and other parts as memory-atomic?? and why!

Auxiliary question: Given the definition <above> of memory-atomic::
are not all to-memory packaged events not ATOMIC in the Lamport sense
as long as the package leaves the core sequentially consistent ??

Both, but the choice is almost "secondary" in a sense as to whether the programmer has done their job first... Also, well, I am quite fond of
relaxed systems. Shit. Perhaps not at relaxed as an alpha, wrt
data-dependent loads.

Before you even pick the instruction class, the programmer must
eliminate false sharing. If the target is an LL/SC architecture, they
really _need_ to know the reservation granule size and pad/align
accordingly. For RMW like LOCK XADD, never straddle a l2 cache line
boundary — pad and align to it. These aren't optional courtesies,
they're prerequisites...?

Actually, an arch that has both LL/SC and atomic RMW, well, that's
pretty nice. I think AMD has that? Give the programmer the ability to
choose.

Beyond that, the deeper answer is: don't hammer a single location at
all! damn it. A single lock-free stack under contention is a problem regardless of whether you're using core-atomic or memory-atomic
instructions. A hash table of lock/wait-free stacks/queues distributes
that pressure. There are single-producer/single consumer (SPSC), multi-producer-single consumer (MPSC), all the way to (MPMC)... Each
algo is optimized for its usage pattern. Design for clever distribution
first, then pick your atomics. Actually if we map the per threads to the
hyper threads, cores, cpu's and numa nodes in a fractal way, then we can
scale fairly well... Gain the topology and localize it in a sense...

For something like a shared counter, I'd go thread-local first,
aggregated per-CPU, then per-NUMA node, etc. Keep sharing as local as
possible — affinity masks in user space to map the topology, and if you
must share outside per-thread variables, share local, as local as
possible...

So for the clean-slate question: memory-atomic where available and
applicable, because it avoids the quadratic cache line transfer problem
you describe under contention. But core-atomic where you need the
flexibility, with the programmer taking responsibility for alignment and distribution. That can be an issue... The programmers MUST know what
they are doing! ;^o

On the auxiliary question: yes, I believe memory-atomic events are
atomic in the Lamport sense?, provided the package leaves the core sequentially consistent — any observer sees only the before or after
state, which is exactly what Lamport requires.

Then there is the SPARC in RMO mode. I have a lot of experience with
that sucker back in the say when Sun gave me a T2000 for the CoolThreads contest and my vzoom project. All the atomic RMW's are relaxed and it
uses its fancy MEMBAR instruction to gain the exact membars that are
needed. For instance, try to avoid #StoreLoad at all costs. Only use
them when needed. Then there is the async, or remote membars ala async lock/wait-free algos. Are you familiar with FlushProcessWriteBuffers on Windows? Afaict, its Microsoft's way to gain exotic algos, like some
variants of RCU?

https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

Actually, before any of that, try to gain a embarrassingly parallel
version of your system and only share when a gun is to your head... ;^)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Sun May 31 20:57:25 2026

From Newsgroup: comp.arch

On 5/28/26 3:37 PM, Chris M. Thomasson wrote:

On 5/27/2026 2:44 PM, MitchAlsup wrote:

[snip atomic op context]

Question: If you were starting lout with a clean slate <as it
were>,
and not knowing if the atomic-shared-library is to be used in
places
where core count was unknown AND interference level was unknown::

Would you implement the atomic-shared-library using
core-atomic inst-
ructions or would you implement the library using
memory-atomic inst-
ructions ?? Or would you implement some of the library as
core-atomic
and other parts as memory-atomic?? and why!

Auxiliary question: Given the definition <above> of
memory-atomic::
are not all to-memory packaged events not ATOMIC in the
Lamport sense
as long as the package leaves the core sequentially consistent ??

Both, but the choice is almost "secondary" in a sense as to
whether the programmer has done their job first... Also, well, I
am quite fond of relaxed systems. Shit. Perhaps not at relaxed
as an alpha, wrt data-dependent loads.

Before you even pick the instruction class, the programmer must
eliminate false sharing. If the target is an LL/SC architecture,
they really _need_ to know the reservation granule size and
pad/align accordingly. For RMW like LOCK XADD, never straddle a
l2 cache line boundary — pad and align to it. These aren't
optional courtesies, they're prerequisites...?

If the operation is performed "centrally" (owning cache, I/O
device), then cache block size may not be important: the cache
can avoid performance problems associated with false sharing
(the singular cache/device acts as a single agent performing
all the operations).

While it would not add significant latency (or significantly
reduce throughput) — I think —, it _seems_ that it would also be
possible for different levels of cache to have different
granularities with respect to exported atomic operations. E.g.,
a cluster of cores sharing L2 cache might use a different
granule for exported atomic operations than L3 (shared by
multiple clusters). A smaller L3 granule would seem to allow for
high contention among many distributed threads with the
assumption that contention within a cluster is necessarily
limited; this might be more a concern when L3 slices are
address-distributed rather than having multiple L3s.

SRAM array width seems a natural granule for optimization as
that is the physical read and write size (typically).

On the other hand, there seem to be other local cache conflict
factors (like bank conflicts) unrelated to "reservation
granule" that such is really probably not a consideration for
software optimization.

One might also have cases where an exported (word granule)
atomic operation is in conflict with a thread-performed (cache
block granule) atomic operation. If the cache block is owned
outside of the thread locally performing an atomic operation,
the exported atomic might be granted priority if the ownership
transfer was still queued, but there might be cache block
ping-pong with thread-performing atomics repeatedly taking
ownership. This implies that consistency with the type of
atomic operation could be important.

For addition, the operations could be coalesced as they pass
toward the owner in the network (and "split" as the result is
returned to individual agents/threads). Such might only make
sense for fire-and-forget operations (like software performance
counters or increasing reference counts) if the centralized
buffering is sufficient. Distributing the operations might
reduce network bandwidth by splitting a message into multiple
messages closer to their return destinations rather than having
multiple messages sent back from the central source, but having
multiple operations on the same data within the queuing delay
seems an uncommon case.

Decreasing reference counts are interesting in that the
interesting case is when the count becomes zero and a resource
freeing operation may be performed. Generally(?) this side
effect would not be synchronous with the thread changing the
reference count, so there could be some sense in allowing such
special cases to be caught outside the thread like a I/O event.
Of course, if resources were being released because they were
scarce and wanted by the releasing thread, then there might be
a latency penalty associated with sending the freeing work to
another hardware context.

Even with ordinary stores that might be read by other threads
(or later in a threads execution under out-of-order execution),
there seems to be some diversity of temporal distance in access
which might be exploited.

There might also be "just-a-peek" memory reads that are used for
general decision making (like reading the consumer pointer of a
buffer to measure roughly how full the buffer is and whether
work priorities should be changed). While it might not be
practical to allow these to have inconsistent ordering (in terms
of software and hardware complexity), there may be optimization
opportunities in exploiting such memory accesses. (Such might
also be useful for value prediction if coherence/ordering
correctness introduces excessive delay, getting a speculative
value earlier may sometimes be useful.)

I remember a paper suggesting assigning limited lifetimes to
cache block ownership to reduce coherence traffic; once the loan
period ended, the block would no longer be held. A "peek" might
be considered an infinitesimal length loan.

Actually, an arch that has both LL/SC and atomic RMW, well,
that's pretty nice. I think AMD has that? Give the programmer
the ability to choose.

Beyond that, the deeper answer is: don't hammer a single
location at all! damn it. A single lock-free stack under
contention is a problem regardless of whether you're using
core-atomic or memory-atomic instructions.

I would guess that there are software development tradeoffs in
how much effort is made to avoid conflicts (and promote
scalability). Sometimes having a global lock may make sense.

A hash table of
lock/wait-free stacks/queues distributes that pressure. There
are single-producer/single consumer (SPSC),
multi-producer-single consumer (MPSC), all the way to (MPMC)...
Each algo is optimized for its usage pattern. Design for clever
distribution first, then pick your atomics. Actually if we map
the per threads to the hyper threads, cores, cpu's and numa
nodes in a fractal way, then we can scale fairly well... Gain
the topology and localize it in a sense...

For something like a shared counter, I'd go thread-local first,
aggregated per-CPU, then per-NUMA node, etc. Keep sharing as
local as possible — affinity masks in user space to map the
topology, and if you must share outside per-thread variables,
share local, as local as possible...

Yet the "topology" changes if the atomic operations are
exported. If all shared counter operations are exported for a
particular address and the hardware uses aggressive
optimizations, then sharing a counter may be sufficiently
scalable with benefits in software simplicity (move complexity
to hardware☺) or other areas.

[snip remaining interesting content]
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Mon Jun 1 15:32:28 2026

From Newsgroup: comp.arch

On 5/31/2026 5:57 PM, Paul Clayton wrote:

On 5/28/26 3:37 PM, Chris M. Thomasson wrote:

On 5/27/2026 2:44 PM, MitchAlsup wrote:

[snip atomic op context]

Question: If you were starting lout with a clean slate <as it were>,
and not knowing if the atomic-shared-library is to be used in places
where core count was unknown AND interference level was unknown::

Would you implement the atomic-shared-library using core-atomic inst-
ructions or would you implement the library using memory-atomic inst-
ructions ?? Or would you implement some of the library as core-atomic
and other parts as memory-atomic?? and why!

Auxiliary question: Given the definition <above> of memory-atomic::
are not all to-memory packaged events not ATOMIC in the Lamport sense
as long as the package leaves the core sequentially consistent ??

Both, but the choice is almost "secondary" in a sense as to whether
the programmer has done their job first... Also, well, I am quite fond
of relaxed systems. Shit. Perhaps not at relaxed as an alpha, wrt
data-dependent loads.

Before you even pick the instruction class, the programmer must
eliminate false sharing. If the target is an LL/SC architecture, they
really _need_ to know the reservation granule size and pad/align
accordingly. For RMW like LOCK XADD, never straddle a l2 cache line
boundary — pad and align to it. These aren't optional courtesies,
they're prerequisites...?

If the operation is performed "centrally" (owning cache, I/O
device), then cache block size may not be important: the cache
can avoid performance problems associated with false sharing
(the singular cache/device acts as a single agent performing
all the operations).

While it would not add significant latency (or significantly
reduce throughput) — I think —, it _seems_ that it would also be
possible for different levels of cache to have different
granularities with respect to exported atomic operations. E.g.,
a cluster of cores sharing L2 cache might use a different
granule for exported atomic operations than L3 (shared by
multiple clusters). A smaller L3 granule would seem to allow for
high contention among many distributed threads with the
assumption that contention within a cluster is necessarily
limited; this might be more a concern when L3 slices are
address-distributed rather than having multiple L3s.

SRAM array width seems a natural granule for optimization as
that is the physical read and write size (typically).

On the other hand, there seem to be other local cache conflict
factors (like bank conflicts) unrelated to "reservation
granule" that such is really probably not a consideration for
software optimization.

One might also have cases where an exported (word granule)
atomic operation is in conflict with a thread-performed (cache
block granule) atomic operation. If the cache block is owned
outside of the thread locally performing an atomic operation,
the exported atomic might be granted priority if the ownership
transfer was still queued, but there might be cache block
ping-pong with thread-performing atomics repeatedly taking
ownership. This implies that consistency with the type of
atomic operation could be important.

For addition, the operations could be coalesced as they pass
toward the owner in the network (and "split" as the result is
returned to individual agents/threads). Such might only make
sense for fire-and-forget operations (like software performance
counters or increasing reference counts) if the centralized
buffering is sufficient. Distributing the operations might
reduce network bandwidth by splitting a message into multiple
messages closer to their return destinations rather than having
multiple messages sent back from the central source, but having
multiple operations on the same data within the queuing delay
seems an uncommon case.

Decreasing reference counts are interesting in that the
interesting case is when the count becomes zero and a resource
freeing operation may be performed. Generally(?) this side
effect would not be synchronous with the thread changing the
reference count, so there could be some sense in allowing such
special cases to be caught outside the thread like a I/O event.
Of course, if resources were being released because they were
scarce and wanted by the releasing thread, then there might be
a latency penalty associated with sending the freeing work to
another hardware context.

Even with ordinary stores that might be read by other threads
(or later in a threads execution under out-of-order execution),
there seems to be some diversity of temporal distance in access
which might be exploited.

There might also be "just-a-peek" memory reads that are used for
general decision making (like reading the consumer pointer of a
buffer to measure roughly how full the buffer is and whether
work priorities should be changed). While it might not be
practical to allow these to have inconsistent ordering (in terms
of software and hardware complexity), there may be optimization
opportunities in exploiting such memory accesses. (Such might
also be useful for value prediction if coherence/ordering
correctness introduces excessive delay, getting a speculative
value earlier may sometimes be useful.)

I remember a paper suggesting assigning limited lifetimes to
cache block ownership to reduce coherence traffic; once the loan
period ended, the block would no longer be held. A "peek" might
be considered an infinitesimal length loan.

Actually, an arch that has both LL/SC and atomic RMW, well, that's
pretty nice. I think AMD has that? Give the programmer the ability to
choose.

Beyond that, the deeper answer is: don't hammer a single location at
all! damn it. A single lock-free stack under contention is a problem
regardless of whether you're using core-atomic or memory-atomic
instructions.

I would guess that there are software development tradeoffs in
how much effort is made to avoid conflicts (and promote
scalability). Sometimes having a global lock may make sense.

A hash table of lock/wait-free stacks/queues distributes that
pressure. There are single-producer/single consumer (SPSC), multi-
producer-single consumer (MPSC), all the way to (MPMC)... Each algo is
optimized for its usage pattern. Design for clever distribution first,
then pick your atomics. Actually if we map the per threads to the
hyper threads, cores, cpu's and numa nodes in a fractal way, then we
can scale fairly well... Gain the topology and localize it in a sense...

For something like a shared counter, I'd go thread-local first,
aggregated per-CPU, then per-NUMA node, etc. Keep sharing as local as
possible — affinity masks in user space to map the topology, and if
you must share outside per-thread variables, share local, as local as
possible...

Yet the "topology" changes if the atomic operations are
exported. If all shared counter operations are exported for a
particular address and the hardware uses aggressive
optimizations, then sharing a counter may be sufficiently
scalable with benefits in software simplicity (move complexity
to hardware☺) or other areas.

[snip remaining interesting content]

The topology needs to be available to the programmer, full stop in a
sense... I get the appeal of "let the hardware handle it" but that's not
good enough? If I can't see the topology, I can't make "informed
decisions" about "data placement", affinity, and locality... Hardware optimizations are a nice bonus on top of good software design. Shit.
Howerver, they're not a substitute for it? And if I can't reason about
the shape of the system, I can't even make an informed decision about
when a global lock is acceptable versus when I need a distributed
structure. That tradeoff requires knowledge, not abstraction.

Also, things do NOT need to be all seq_cst. Shit. My SPARC/RMO
experience is directly relevant here. All the atomic RMWs are relaxed,
and you use MEMBAR to express exactly the ordering you need and nothing
more. Avoid #StoreLoad at all costs — only pay for it when the algorithm genuinely demands it. That discipline forces you to think carefully
about what guarantees you actually need between specific threads, which
is a feature not a bug.

And here's where the two points connect: if you understand the topology,
you know which threads share which data, which tells you what ordering guarantees you actually need between them. Abstract away the topology
and you're pushed toward conservative, expensive, blanket seq_cst
everywhere because you can't reason about anything more precise. Expose
the topology and you can be surgical about it.

Relaxed where you can. Stronger ordering only where you must. And always knowing the shape of the system you're running on. ;^)

Also, for x86 yes we need to pad and align for the l2 cache line anyway.
So shit happens?
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Jun 2 01:18:36 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 5/31/2026 5:57 PM, Paul Clayton wrote:

On 5/28/26 3:37 PM, Chris M. Thomasson wrote:

On 5/27/2026 2:44 PM, MitchAlsup wrote:

----------------

[snip remaining interesting content]

The topology needs to be available to the programmer, full stop in a sense... I get the appeal of "let the hardware handle it" but that's not good enough? If I can't see the topology, I can't make "informed
decisions" about "data placement", affinity, and locality...

So, what kind of topology is an appropriate description of a system
as small as 1 core, 1 main bus, and 1 PCIe tree or as large as 1M
cores, 1024 independent interconnects, and <essentially> unbounded
number of PCIe trees (such as a basketball stadium sized server) ???

Hardware optimizations are a nice bonus on top of good software design. Shit. Howerver, they're not a substitute for it? And if I can't reason about
the shape of the system, I can't even make an informed decision about
when a global lock is acceptable versus when I need a distributed
structure.

What if there is no concept of a global lock--except one created by
SW's use of the topology in order to bring semblance of order into
which there is essentially none ??

That tradeoff requires knowledge, not abstraction.

Agreed, but do any of us have the vocabulary to adequately describe it ??

Also, things do NOT need to be all seq_cst. Shit. My SPARC/RMO
experience is directly relevant here. All the atomic RMWs are relaxed,
and you use MEMBAR to express exactly the ordering you need and nothing more. Avoid #StoreLoad at all costs — only pay for it when the algorithm genuinely demands it. That discipline forces you to think carefully
about what guarantees you actually need between specific threads, which
is a feature not a bug.

What percent of programmers are capable of absorbing this kind of
knowledge and making effective use of it ??

And here's where the two points connect: if you understand the topology,
you know which threads share which data, which tells you what ordering guarantees you actually need between them.

Say your system is as big as the entire galaxy where it takes 100,000 years
for a signal from one side to reach the other side (and another 100,000
years coming back) yet because of its size you get 10**(10**100) times the compute power you have on earthy scale machines; can you think of an
adequate abstraction that would allow SW to make use of this scale of
a system that is not already embarrassingly parallel ??

Abstract away the topology
and you're pushed toward conservative, expensive, blanket seq_cst
everywhere because you can't reason about anything more precise. Expose
the topology and you can be surgical about it.

Relaxed where you can. Stronger ordering only where you must. And always knowing the shape of the system you're running on. ;^)

Consider the galaxy scale system mentioned above. What kind of topology
will be adequate for that kind of scale ??

Also, for x86 yes we need to pad and align for the l2 cache line anyway.
So shit happens?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 13:45:13 2026

From Newsgroup: comp.arch

On 6/1/2026 6:18 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

[...]

Consider the galaxy scale system mentioned above. What kind of topology
will be adequate for that kind of scale ??

Sorry for the quick response, I am really busy right now. Will get back
to you with more detail sometime tonight.

Galaxy level? The super massive black hole. Its smaller black holes
wandering the dust lanes? The stars, the solar systems, the planets in
orbit. Down to the per-thread level...

Say, for starters:

per-thread
per-cpu
per-numa

finally go global only if we have to!

Also, try to avoid sharing when possible.

As for programmers that know this shit, well, they are few and far
between. But, that does not mean, to borrow a phrase from Joe Seigh...

When we get hardware we make software.
When we get lemons we make lemonade.

I love RMO! The Alpha is too strict, but shit happens. Alpha is too
strict in the sense that it does not honor a load dependency. RCU guys
know all about it... shit happens...

Also, for x86 yes we need to pad and align for the l2 cache line anyway.
So shit happens?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 2 13:47:27 2026

From Newsgroup: comp.arch

On 6/2/2026 1:45 PM, Chris M. Thomasson wrote:

On 6/1/2026 6:18 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

[...]

Consider the galaxy scale system mentioned above. What kind of topology
will be adequate for that kind of scale ??

Sorry for the quick response, I am really busy right now. Will get back
to you with more detail sometime tonight.

Galaxy level? The super massive black hole. Its smaller black holes wandering the dust lanes? The stars, the solar systems, the planets in orbit. Down to the per-thread level...

Say, for starters:

per-thread
per-cpu
per-numa

finally go global only if we have to!

Also, try to avoid sharing when possible.

As for programmers that know this shit, well, they are few and far
between. But, that does not mean, to borrow a phrase from Joe Seigh...

When we get hardware we make software.
When we get lemons we make lemonade.

I love RMO! The Alpha is too strict, but shit happens. Alpha is too
strict in the sense that it does not honor a load dependency. RCU guys
know all about it... shit happens...

Also, for x86 yes we need to pad and align for the l2 cache line anyway. >>> So shit happens?

;^)

per-thread
per-hyperthread/SMT slot
per-core
per-CPU (socket)
per-NUMA node
per-interconnect domain
per-rack
per-datacenter

...per-galaxy
--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Sun Jun 7 15:23:15 2026

From Newsgroup: comp.arch

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

This is mainly for Chris M. Thomasson but everyone is free to pile on::

Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties
see that the unit of work was either never started or finished completely >without any intermediate state being observed by any party other than the >core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >fall under this definition.

define: memory-atomic as a unit of work identified in a core and packaged
up to be performed by the cache hierarchy or PCIe device control registers. >Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, >CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >of-work are performed in the cache having write-permission {EXCLUSIVE or >MODIFIED cache line states}. When unCacheable, these units of work are >performed directly on the addressed control register. In both cases, any >interested party sees only the previous state or the after state.

I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.

The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.

I feel the best rule for atomic design (for hardware and software) is to
always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.

The problem with memory-atomic/Far-atomic is if the lock is on the same cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around. To do the Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.

There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the platform-specific way to do an atomic transaction (and with all the restrictions), this is high overhead for software, and does NOT match how
most software is written.

But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that cacheline Exclusive and do the atomic operation, a Tentative Transaction starts. You block responding to snoops until we decide we have to. We
can do misses and whatnot, but after say 500 clocks, we allow snoops
again (to avoid deadlocks). The transaction ends when the atomic location
is written again.

What this achieves is an end to thrashing. When it's working, the transaction-prediction hardware lets the thread get the lock, and do its
work, and release the lock, in one go, not responding to outside snoops.
When it does respond to the snoops, the first snooper immediately wins the
lock and gets its work done, and so on. For this to work easily, the
software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.

Add a Transaction-prediction structure, to mark which atomics seem to be
the start of a transaction, and which ones are not, and roughly what that transaction looks like. And then you've solved the cache thrashing of locks--after seeing an atomic a few times, you learn what it's doing, and
can try to make it succeed each time it executes. And how long each transaction seems to be, the raise the limit for blocking snoops to
allow it to be more likely to succeed.

To handle the load peeking at a lock to see if its free requires
the Transaction-prediction hardware to detect this load and effectively
turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
and it becomes the start of the transaction.

This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing. If the software algorithm is poor,
where there's a big lock protecting a lot of work, then it exceeds the hardware's ability to do efficiently, and the hardware instead works like
other platforms, and there can be thrashing.

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun Jun 7 18:34:11 2026

From Newsgroup: comp.arch

[email protected] (Kent Dickey) posted:

Thank you Kent, much valuable information below.

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

This is mainly for Chris M. Thomasson but everyone is free to pile on::

Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >see that the unit of work was either never started or finished completely >without any intermediate state being observed by any party other than the >core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >fall under this definition.

define: memory-atomic as a unit of work identified in a core and packaged >up to be performed by the cache hierarchy or PCIe device control registers. >Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, >CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >of-work are performed in the cache having write-permission {EXCLUSIVE or >MODIFIED cache line states}. When unCacheable, these units of work are >performed directly on the addressed control register. In both cases, any >interested party sees only the previous state or the after state.

I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.

LL/SCs are designed to take no more than LD/ST pipeline time outside of contention.

The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.

What do you do when you have more than 1 cache line--and all lines need
to be in an exclusive state prior to modifying concurrent data ??

I feel the best rule for atomic design (for hardware and software) is to always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.

OK.

The problem with memory-atomic/Far-atomic is if the lock is on the same cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around.

The whole reason for {memory,far}-atomic is to avoid moving data around
the interconnect inorder to perform an event/transaction.

To do the Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.

Fair point-but Chris M. Thomasson makes the point that the locks should not
be placed together--your position is that this happens "more or less" all
the time. Hmmmmm

There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the platform-specific way to do an atomic transaction (and with all the restrictions), this is high overhead for software, and does NOT match how most software is written.

But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that cacheline Exclusive and do the atomic operation, a Tentative Transaction starts. You block responding to snoops until we decide we have to. We

In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention
because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become
Exclusive.

can do misses and whatnot, but after say 500 clocks, we allow snoops
again (to avoid deadlocks). The transaction ends when the atomic location
is written again.

In a system with multiple-chips each containing dozens of cores, 500
cycles might not be enough to even send a CI around the interconnect.

What this achieves is an end to thrashing. When it's working, the transaction-prediction hardware lets the thread get the lock, and do its work, and release the lock, in one go, not responding to outside snoops.

My 66000 ships <core> priority around with memory accesses so that the
remote cache can determine if it fails or if requestor fails. Helping
higher priority atomic-events at the cost to lower priority events.

But also; if core reaches the beginning of the stores (to participating
cache lines) while all this data traffic is sorting itself out, then,
core gains the ability to NaK SNOOP requests (priority based) to
participating lines until the event finishes with the SC to the final
variable in memory.

My 66000 ISA contains the necessary instruction semantics to perform core-atomic events over multiple participating cache lines (up to 8).

When it does respond to the snoops, the first snooper immediately wins the lock and gets its work done, and so on. For this to work easily, the software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.

This seems to me to be wrongly optimistic!! Another core attempting a transaction on a concurrent data structure, and the attempt fails
needs to understand that the CDS is no longer in the state this core
read out at the beginning, and instead of trying to complete, it should
take a fresh look at CDS. For example several cores try to deQueue
work from the front of a Queue. One core "wins" and deQueues the front
unit of work. The others need to know the Queue element they were going
after is not there are fail to a point where they take a fresh look at
the Queue.

Add a Transaction-prediction structure, to mark which atomics seem to be
the start of a transaction, and which ones are not, and roughly what that transaction looks like. And then you've solved the cache thrashing of locks--after seeing an atomic a few times, you learn what it's doing, and
can try to make it succeed each time it executes. And how long each transaction seems to be, the raise the limit for blocking snoops to
allow it to be more likely to succeed.

How are you going to weave in that some cores have higher priority than
others? You don't want a low priority queue NaKing snoops which would
allow the higher priority core to make forward progress at the expense
of the lower priority cores. As an example, the low priority event is registering that the left-rear tire is low on air pressure, while the
high priority event is registering panic-braking to avoid a collision.

To handle the load peeking at a lock to see if its free requires
the Transaction-prediction hardware to detect this load and effectively
turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
and it becomes the start of the transaction.

My 66000 can respond to Read-with-Intent with here is data, but it is Shared not Exclusive (based on priority at location of cache line).

This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing. If the software algorithm is poor,
where there's a big lock protecting a lot of work, then it exceeds the hardware's ability to do efficiently, and the hardware instead works like other platforms, and there can be thrashing.

I agree with most of the functionality of the transaction-predictor,
but it is already built into My 66000 ISA--and is not a predictor,
but mechanics.

Kent

THanks again.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Sun Jun 7 12:14:24 2026

From Newsgroup: comp.arch

On 6/7/2026 8:23 AM, Kent Dickey wrote:

In article <[email protected]>

Sorry for the big snip...

[...]

Fwiw, from my personal experience, two locks need to be padded and
aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
the reservation granule. Get the granule size wrong on LL/SC and your SC
can cause some interesting harm (a lot of sc failures, etc.) to another
core's reservation even when the locks have no logical relationship to
each other...

So if your algo is using LL/SC you need to worry about the reservation granule.

If its using something akin to LOCK RMW, well, the l2 cache line is your friend.
--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Mon Jun 8 20:21:52 2026

From Newsgroup: comp.arch

In article <1104fui$2ns6j$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/7/2026 8:23 AM, Kent Dickey wrote:

In article <[email protected]>

Sorry for the big snip...

[...]

Fwiw, from my personal experience, two locks need to be padded and
aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
the reservation granule. Get the granule size wrong on LL/SC and your SC
can cause some interesting harm (a lot of sc failures, etc.) to another >core's reservation even when the locks have no logical relationship to
each other...

So if your algo is using LL/SC you need to worry about the reservation >granule.

If its using something akin to LOCK RMW, well, the l2 cache line is your >friend.

Yes, different mutexes/locks/etc. need to be spread apart from each other,
to avoid accidental contention. However, if you have a huge number of mutexes/locks/etc. then it's less important, since by design you're
spreading accesses out over "enough", so you're less likely to get any contention. If you have 300 CPUs and 10,000 database rows with a mutex
per row, then contention is not likely a big deal.

My quick look at C++:

#include <mutex>
std::mutex sample_mutex;

show sample_mutex is 48 bytes. This includes OS/thread overhead to put the thread to sleep if the fast-grab algorithm doesn't work after a few tries, along with some internal alignment padding.

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Mon Jun 8 21:36:34 2026

From Newsgroup: comp.arch

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

[email protected] (Kent Dickey) posted:

Thank you Kent, much valuable information below.

[snip]

I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.

LL/SCs are designed to take no more than LD/ST pipeline time outside of >contention.

The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.

What do you do when you have more than 1 cache line--and all lines need
to be in an exclusive state prior to modifying concurrent data ??

Terminology is tricky here. I was using "atomic" to be the individual
atomic instruction, like Compare-and-Swap, or the operation(s) to achieve
a mutex (which might be an atomic instruction and 2-3 others).

Atomics can be used for lock-free algorithms, such as producer/consumer adjusting FIFO ptrs, where there's no spinning, and the atomic operation
is not followed by a "release" of some sort.

So I wanted to talk just about mutexes, where we need an atomic instruction
(or 2-3) to enter a critical area, then we do some amount of work which
is just normal code, then we "release" the mutex.

And what I was trying to get across is it's best if these are done local to
the CPU (and not as Far Atomics), and that hardware can do some fairly
simple things to solve the contention storm issue, using some history
logic to remember how each atomic behaves.

It's my opinion there is no good use case for Far-Atomics/Memory-Atomics,
but that's a bigger issue I don't have time to explain in detail. Well,
if you have non-CPUs who need coordination with CPUs, Far-Atomics/Memory- Atomics might make sense, but that's a special use case.

I feel the best rule for atomic design (for hardware and software) is to
always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.

OK.

The problem with memory-atomic/Far-atomic is if the lock is on the same
cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around.

The whole reason for {memory,far}-atomic is to avoid moving data around
the interconnect inorder to perform an event/transaction.

Let's say I have a mutex, protecting a memory-resident logging facility. A thread needs to win the mutex, and then use the logging structure to
write a message, where there's a buffer, a buffer_length, and a
current_offset. In the rare case where the buffer fills, the thread needs
to flush the buffer out, and reset the offset. When done, release the mutex.

#include <mutex>
struct {
std::mutex mutex;
char *buffer_ptr;
int buffer_size;
int buffer_offset;
} Logger;

The mutex contains an internal "lock" variable, where 0 means free, and non-0 means some thread has the mutex. So grabbing "lock" means do a Compare-and-Swap, checking that it is 0, and writing in our thread info if
we win.

So the logging function is:

Logger::log(char *str, int len)
{
int new_len;

mutex.lock(); // Acquire the lock
new_len = MIN(buffer_size - buffer_offset, len);
memcpy(&buffer_ptr[buffer_offset], str, new_len);
buffer_offset += new_len;
if(new_len >= buffer_size) {
str += new_len;
len -= new_len;
// Flush buffer, etc. etc.
(void)write(logfd, buffer_ptr, buffer_size);
buffer_offset = 0;
if(len) {
memcpy(&buffer_ptr[0], str, len);
}
}
mutex.unlock();
}

Grabbing the mutex is the "atomic" operation. Doing the entire log()
function is the Transaction, from mutex.lock() through mutex.unlock(). mutex.unlock() is basically just a store to the lock address to write it
to 0 (but it might involve waking up threads, etc.).

So what I'm proposing is hardware detects that there is a "Transaction"
from mutex.lock() through mutex.unlock(). And it tries to hold off
snoops to the mutex lock address until it completes all the work inside.
What this achieves is the snoop response is then the lock address with
the lock indicating "free", so the next CPU immediately wins the lock
and can do it's log() work. But if it takes "too long" in log(), then
the CPU lets snoops through (which is needed to avoid system deadlocks,
in general), and then it can expose the mutex.lock state while log() is
still running.

And this solves the contention storm. The contention storm happens when
other threads see the mutex.lock variable as locked. In fact, any read
to the mutex.lock variable which shows it locked is wasted effort. So,
if the CPU which wins the mutex can hold off snoops until it unlocks the
mutex, then the contention storm goes away. The mutex just nicely
passes to other CPUs one at a time.

The idea is not to JUST do the atomic operation, but to try to do the
entire log() function at once--but it's NOT required to always succeed
in doing it all at once. If hw can just do it 99% of the time, that
would be a HUGE win. The atomic mutex.lock() and mutex.unlock() must
always be atomic, but if hardware can avoid letting other CPUs see the
hidden lock variable in the locked state most of the time, it fixes the contention storm.

We cannot always succeed in holding off snoops. The CPU which wins the
mutex and then tries to access *buffer_ptr might deadlock the system if
we don't allow snoops to us to proceed under certain traffic conditions,
even in the simplest of cases. In this case, clearly the write() system
call requires snoops to be allowed in.

So we need a timer, and then in the generally rare case when we cannot
just get it all done at once, let snoops happen, and avoid deadlock.
This does cause the start of a contention storm (all CPUs read the lock,
see it's busy, and then try to grab it), but as soon as the next CPU
wins the mutex, the system can go back to the orderly passing to the next
CPU.

To do the
Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The >> usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.

Fair point-but Chris M. Thomasson makes the point that the locks should not >be placed together--your position is that this happens "more or less" all
the time. Hmmmmm

It's important that different Mutex'es be on different cachelines. But
look at my code example: I have a structure with a mutex and with other
data. If I win the mutex, then the FIRST thing I need is to access that
other data. So it would be great if winning the mutex brought that data
into cache as well.

This is why I think Far-Atomics/Memory-Atomics are not good ideas. If mutex.lock() is done at the memory controller, then the CPU has to bring
in that cache line ANYWAY right afterwords.

There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the
platform-specific way to do an atomic transaction (and with all the
restrictions), this is high overhead for software, and does NOT match how
most software is written.

But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that
cacheline Exclusive and do the atomic operation, a Tentative Transaction
starts. You block responding to snoops until we decide we have to. We

In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention >because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become >Exclusive.

It's easy to detect a mutex. Have a Transaction Prediction structure
indexed by a hash function which includes the address and the
instruction address, and store the address in the entry along with a bit
for CAS, and a bit for Mutex. Mark the entry as having a CAS done to
it, and then when hw sees a store to that some address "soon after",
upgrade the entry to a mutex. Now, when we do a CAS and look it up,
we'll see this is a mutex, and to try to hold off snoops until we see
the store to that address.

If we only ever see CAS to an address, and always from the same instruction address, then it's not a mutex. It would be useful to detect other
Transaction types, but this post is already too long to go into other
cases.

Mutexes are easy for software to use, easy for humans to understand and
reason about, and relatively easy for hardware to make very efficient.

can do misses and whatnot, but after say 500 clocks, we allow snoops
again (to avoid deadlocks). The transaction ends when the atomic location >> is written again.

In a system with multiple-chips each containing dozens of cores, 500
cycles might not be enough to even send a CI around the interconnect.

I would expand the Transaction Prediction structure to have a few bits indicating the time to hold off snoops. Start low, and if we keep seeing
the store not happen in time, and yet it still looks like a mutex, keep increasing the delay. This prevents all atomics from causing long snoop
delay periods, while tuning in enough delay to handle the operations of
real use cases.

What this achieves is an end to thrashing. When it's working, the
transaction-prediction hardware lets the thread get the lock, and do its
work, and release the lock, in one go, not responding to outside snoops.

My 66000 ships <core> priority around with memory accesses so that the
remote cache can determine if it fails or if requestor fails. Helping
higher priority atomic-events at the cost to lower priority events.

But also; if core reaches the beginning of the stores (to participating
cache lines) while all this data traffic is sorting itself out, then,
core gains the ability to NaK SNOOP requests (priority based) to >participating lines until the event finishes with the SC to the final >variable in memory.

My 66000 ISA contains the necessary instruction semantics to perform >core-atomic events over multiple participating cache lines (up to 8).

I'm proposing an idea that can be implemented on any CPU, and requires no special support from the interconnect.

I think your NAK idea sounds risky. I've dealt a lot in the past with
buses using retries, and the short answer is it's VERY HARD to implement
a protocol with retries and not have degenerate starvation cases. Everyone thinks they've got it solved, that it's easy, and it's just not easy. There's a reason retries are removed from almost all modern protocols.

The problem is if you have 3 CPUs contending, and the way NAKs work one
CPU might be the unlucky victim forever, with 2 CPUs starving it forever.
In general, it's best to avoid retry-like behavior. For simple cases, you could build in some starvation detection, but in general that just pushes
your starvation/livelock case to a more complex case, not fix it.

When it does respond to the snoops, the first snooper immediately wins the >> lock and gets its work done, and so on. For this to work easily, the
software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.

This seems to me to be wrongly optimistic!! Another core attempting a >transaction on a concurrent data structure, and the attempt fails
needs to understand that the CDS is no longer in the state this core
read out at the beginning, and instead of trying to complete, it should
take a fresh look at CDS. For example several cores try to deQueue
work from the front of a Queue. One core "wins" and deQueues the front
unit of work. The others need to know the Queue element they were going
after is not there are fail to a point where they take a fresh look at
the Queue.

I tried to expand my thinking with the log() function example so you can see that each CPU will see the updated state left by the previous one.

For your CPU, you propose a Transaction sequence with a start and end instruction, and I'm just pointing out you can get basically all of the
benefit with some prediction hardware picking out mutexes and making them
just as efficient. But with NONE of the deadlock or forward progress
issues, since if things go "wrong", you just default to running the code
as written.

Add a Transaction-prediction structure, to mark which atomics seem to be
the start of a transaction, and which ones are not, and roughly what that
transaction looks like. And then you've solved the cache thrashing of
locks--after seeing an atomic a few times, you learn what it's doing, and
can try to make it succeed each time it executes. And how long each
transaction seems to be, the raise the limit for blocking snoops to
allow it to be more likely to succeed.

How are you going to weave in that some cores have higher priority than >others? You don't want a low priority queue NaKing snoops which would
allow the higher priority core to make forward progress at the expense
of the lower priority cores. As an example, the low priority event is >registering that the left-rear tire is low on air pressure, while the
high priority event is registering panic-braking to avoid a collision.

Priority doesn't come into play at this level. I'm just trying to help
make the log() function appear to execute as a single unit most of the time. And if "most" becomes "never" for some pathological cases, then in the
worst case it works just like existing CPUs.

To handle the load peeking at a lock to see if its free requires
the Transaction-prediction hardware to detect this load and effectively
turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
and it becomes the start of the transaction.

My 66000 can respond to Read-with-Intent with here is data, but it is Shared >not Exclusive (based on priority at location of cache line).

This seems like a not very useful thing to do, why would you do that?
If a CPU says it's going to write to a cacheline, and you give it data but Shared ownership state, it just has to immediately ask for write
permission again.

But this doesn't affect my proposal at all. The CPU has to get write permission to do the Compare-and-Swap for the mutex.lock() operation, and
once it does, then the Transaction functionality kicks in.

This avoids the need for any software changes, while directly speeding up
atomics and what atomics are doing. If the software algorithm is poor,
where there's a big lock protecting a lot of work, then it exceeds the
hardware's ability to do efficiently, and the hardware instead works like
other platforms, and there can be thrashing.

I agree with most of the functionality of the transaction-predictor,
but it is already built into My 66000 ISA--and is not a predictor,
but mechanics.

Kent

THanks again.

And I'm proposing no new architecturally visible changes to make some
lock operations more efficient. So that any CPU could do it.

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Jun 9 00:45:14 2026

From Newsgroup: comp.arch

[email protected] (Kent Dickey) posted:

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

------------------------

My 66000 ISA contains the necessary instruction semantics to perform >core-atomic events over multiple participating cache lines (up to 8).

I'm proposing an idea that can be implemented on any CPU, and requires no special support from the interconnect.

I think your NAK idea sounds risky. I've dealt a lot in the past with
buses using retries, and the short answer is it's VERY HARD to implement
a protocol with retries and not have degenerate starvation cases.

Yes, it is tricky, no it is not impossible. The core must be in a state
that is rather special in order to NaK snoop responses. In fact both
cores need to paly their parts of the tricky game. The snooped core
must only NaK when it is able to get all the way through the STs and
complete the transaction/event; and the snooping core must understand
if it is attempting an AOMTIC event (or not). If it is, the NaK signals
failure of the event, if it is not, the NaK has the core retry its NAKed request.

A NaK retry is on the order of 400-cycles, so the interfering core takes
the added latency, while the core processing the event completes apace.

Everyone thinks they've got it solved, that it's easy, and it's just not easy. There's
a reason retries are removed from almost all modern protocols.

Do you understand how hard it is to perform Double-Compare-Double-Swap
without something special in the protocol under withering contention ??

The problem is if you have 3 CPUs contending, and the way NAKs work one
CPU might be the unlucky victim forever, with 2 CPUs starving it forever.

For the case where all cores are at equal priority, one wants a semblance
of fairness--witness CRAY-XMP as one where cores 7 and 8 could be locked
out of memory references when cores [1..6] were beating on memory at full speed. {Not even ATOMIC}

For the case where cores have priority, one wants fairness based on priority and when 1 core is at lower priority, it takes the hits to allow higher priority cores to make <more> forward progress.

One would not want a low priority Guest OS worker thread to hold up some
high priority kernel thread just because it got to the lock first. This
is commonly known as priority inversion.

In general, it's best to avoid retry-like behavior. For simple cases, you could build in some starvation detection, but in general that just pushes your starvation/livelock case to a more complex case, not fix it.

When a core detects failure it increments a counter, when the counter gets
to a certain point, the core uses a different strategy to make forward progress. Under this <rarely used> strategy, it bundles up the addresses
it needs to complete the event and ships them to a system arbiter. If no address is currently reserved, the system arbiter grants exclusive access
to all <up to> 8 lines simultaneously. At this point core can Nak any
snoop on those addresses. When the event is complete, core sends the same addresses to system arbiter; who removes their reservations.

This can transform a BigO( n^3 ) problem under heavy contention into a
BigO( 3 ) problem {when SW knows how to use the data HW provides}.

When it does respond to the snoops, the first snooper immediately wins the >> lock and gets its work done, and so on. For this to work easily, the
software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.

This seems to me to be wrongly optimistic!! Another core attempting a >transaction on a concurrent data structure, and the attempt fails
needs to understand that the CDS is no longer in the state this core
read out at the beginning, and instead of trying to complete, it should >take a fresh look at CDS. For example several cores try to deQueue
work from the front of a Queue. One core "wins" and deQueues the front
unit of work. The others need to know the Queue element they were going >after is not there are fail to a point where they take a fresh look at
the Queue.

I tried to expand my thinking with the log() function example so you can see that each CPU will see the updated state left by the previous one.

For your CPU, you propose a Transaction sequence with a start and end instruction, and I'm just pointing out you can get basically all of the benefit with some prediction hardware picking out mutexes and making them just as efficient.

Yes, I saw that.

But with NONE of the deadlock or forward progress
issues, since if things go "wrong", you just default to running the code
as written.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:20:53 2026

From Newsgroup: comp.arch

On 6/7/2026 11:34 AM, MitchAlsup wrote:

[email protected] (Kent Dickey) posted:

Thank you Kent, much valuable information below.

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

This is mainly for Chris M. Thomasson but everyone is free to pile on::

Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >>> see that the unit of work was either never started or finished completely >>> without any intermediate state being observed by any party other than the >>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >>> fall under this definition.

DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK CMPXCHG8B on a 32 bit system.

define: memory-atomic as a unit of work identified in a core and packaged >>> up to be performed by the cache hierarchy or PCIe device control registers. >>> Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP,
CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >>> of-work are performed in the cache having write-permission {EXCLUSIVE or >>> MODIFIED cache line states}. When unCacheable, these units of work are
performed directly on the addressed control register. In both cases, any >>> interested party sees only the previous state or the after state.

I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.

LL/SCs are designed to take no more than LD/ST pipeline time outside of contention.

The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.

What do you do when you have more than 1 cache line--and all lines need
to be in an exclusive state prior to modifying concurrent data ??

I feel the best rule for atomic design (for hardware and software) is to
always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.

OK.

The problem with memory-atomic/Far-atomic is if the lock is on the same
cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around.

The whole reason for {memory,far}-atomic is to avoid moving data around
the interconnect inorder to perform an event/transaction.

To do the
Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The >> usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.

Fair point-but Chris M. Thomasson makes the point that the locks should not be placed together--your position is that this happens "more or less" all
the time. Hmmmmm

Unfortunately its true. The mutex word, or state if you will, should be
padded and aligned up to a l2 cache line for atomic RMW or a reservation granule for LL/SC.

There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the
platform-specific way to do an atomic transaction (and with all the
restrictions), this is high overhead for software, and does NOT match how
most software is written.

But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that
cacheline Exclusive and do the atomic operation, a Tentative Transaction
starts. You block responding to snoops until we decide we have to. We

In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become
Exclusive.

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state machines and such.

[...]

The main point is that the software needs to pad and align accordingly.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:30:58 2026

From Newsgroup: comp.arch

On 6/7/2026 8:23 AM, Kent Dickey wrote:
[...]

This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing.

What about weak vs strong CAS? Also how does emulating a LOCK XADD speed things up? Now SC can fail, and we have to loop. How many times do we
have to loop to know what that emulated LOCK XADD finishes? Its akin to emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
former is quite grand.

Breaking things apart works just as well for LL/SC as atomic RMW.

How does the LL/SC work wrt the arch? How fine grain is it? Is its
reservation granule the same size as say a L2 cache line for atomic RMW?

On systems with both LL/SC and atomic RMW, is it one size fits all?

LL/SC can suffer from live lock and false sharing.

LOCK XADD can suffer from false sharing, its not going to live lock on ya.

[...]
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:35:01 2026

From Newsgroup: comp.arch

On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]

Do you understand how hard it is to perform Double-Compare-Double-Swap without something special in the protocol under withering contention ??

It sounds like you are trying to fulfill a lot of the academic papers?

However, DWCAS is a really important one. This is only a double width
compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are non-contiguous, then things get into LL/SC like worlds...

[...]
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 9 14:37:54 2026

From Newsgroup: comp.arch

On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]

Have you ever read about KCSS? K compare single swap? An older paper.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Jun 10 01:25:02 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

[email protected] (Kent Dickey) posted:

Thank you Kent, much valuable information below.

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

This is mainly for Chris M. Thomasson but everyone is free to pile on:: >>>
Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >>> see that the unit of work was either never started or finished completely >>> without any intermediate state being observed by any party other than the >>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS}
fall under this definition.

DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK CMPXCHG8B on a 32 bit system.

I wrote DCADS to represent the two addresses are completely unrelated
{like *p and *q are unrelated unless you know something special.
You can update 2 pointers in a linked list.
------------------------

In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become Exclusive.

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded!

This is what allows the ABA problem to raise its ugly head !! There
could be a week (7 days) between the LD and the CAS--there is no
expectation that the data (modified hundreds of times per second
over that week returns to the exact state at when the LD was performed.)

strong
does not allow that to occur. So, it can be used for interesting state machines and such.

[...]

The main point is that the software needs to pad and align accordingly.

I have no argument with that. My argument is that one needs FEWER ATOMIC
events when more memory locations can be manipulated in a single event.
Fewer Events lower the exponent of interference and improves performance.

Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Jun 10 01:28:13 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]

Do you understand how hard it is to perform Double-Compare-Double-Swap without something special in the protocol under withering contention ??

It sounds like you are trying to fulfill a lot of the academic papers?

However, DWCAS is a really important one. This is only a double width compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are non-contiguous, then things get into LL/SC like worlds...

DDCAS works better as both comparands are 64-bit and back-to-back.

Don't see how you can much to work with two 32-bit comparands.

[...]

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Jun 10 01:33:30 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

-------------------------

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state machines and such.

[...]

The main point is that the software needs to pad and align accordingly. Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.

If you want the strong CAS it is written::

LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

If you want weak CAS it is written:

LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

It just changes the length of time interference is being observed.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 12:32:47 2026

From Newsgroup: comp.arch

On 6/9/2026 6:28 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]

Do you understand how hard it is to perform Double-Compare-Double-Swap
without something special in the protocol under withering contention ??

It sounds like you are trying to fulfill a lot of the academic papers?

However, DWCAS is a really important one. This is only a double width
compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are
non-contiguous, then things get into LL/SC like worlds...

DDCAS works better as both comparands are 64-bit and back-to-back.

Don't see how you can much to work with two 32-bit comparands.

[...]

Think of DWCAS on a 32-bit system. LOCK CMPXCHG8B on a 32 bit system. We
can do many interesting things with it. LOCK CMPXCHG16B on a 64-bit
system, ditto.

DWCAS only operates on two contiguous words. Well, that alone allows us
to create a lot of lock/wait free algos. The academic papers were
interested in DCAS, KCSS, ect... They never had to make the cpu that can
do that efficiently.

FWIW, DWCAS can do other things besides ABA. Think of using two 32 bit
regions of memory and a DWCAS can store "offsets" into both of them.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 12:50:18 2026

From Newsgroup: comp.arch

On 6/9/2026 6:25 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

[email protected] (Kent Dickey) posted:

Thank you Kent, much valuable information below.

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

This is mainly for Chris M. Thomasson but everyone is free to pile on:: >>>>>
Given a normal chip with multiple cores and a typical cache hierarchy >>>>> with one-or-more PCIe root complexes::

define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >>>>> see that the unit of work was either never started or finished completely >>>>> without any intermediate state being observed by any party other than the >>>>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS}
fall under this definition.

DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK
CMPXCHG8B on a 32 bit system.

I wrote DCADS to represent the two addresses are completely unrelated
{like *p and *q are unrelated unless you know something special.
You can update 2 pointers in a linked list.
------------------------

In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention
because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become
Exclusive.

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded!

This is what allows the ABA problem to raise its ugly head !! There
could be a week (7 days) between the LD and the CAS--there is no
expectation that the data (modified hundreds of times per second
over that week returns to the exact state at when the LD was performed.)

ABA is an issue, wrt well you know the appendix of a IBM paper. Iirc,
Annn and Lynn, sorry for butchering their names, were around when CAS
was invented, iirc Lynn said it was their initials?

However, I tend to work around that. For instance think of a distributed amortized lock-free stack that completely avoid DWCAS and ABA. So, for
insert we can use CAS. BUT, for the pop, we flush all with an atomic SWAP.

strong
does not allow that to occur. So, it can be used for interesting state
machines and such.

[...]

The main point is that the software needs to pad and align accordingly.

I have no argument with that. My argument is that one needs FEWER ATOMIC events when more memory locations can be manipulated in a single event.
Fewer Events lower the exponent of interference and improves performance.

Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an
atomic RMW using L2 cache lines.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 12:53:07 2026

From Newsgroup: comp.arch

On 6/9/2026 6:33 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

-------------------------

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state
machines and such.

[...]

The main point is that the software needs to pad and align accordingly.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an
atomic RMW using L2 cache lines.

If you want the strong CAS it is written::

LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

If you want weak CAS it is written:

LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

It just changes the length of time interference is being observed.

Pretty interesting! So, we can have spurious failure on the weak CAS and
none on the strong CAS?

strong CAS failure can be used in a state machine, weak CAS failure
cannot...
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 13:16:00 2026

From Newsgroup: comp.arch

On 6/8/2026 1:21 PM, Kent Dickey wrote:

In article <1104fui$2ns6j$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/7/2026 8:23 AM, Kent Dickey wrote:

In article <[email protected]>

Sorry for the big snip...

[...]

Fwiw, from my personal experience, two locks need to be padded and
aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
the reservation granule. Get the granule size wrong on LL/SC and your SC
can cause some interesting harm (a lot of sc failures, etc.) to another
core's reservation even when the locks have no logical relationship to
each other...

So if your algo is using LL/SC you need to worry about the reservation
granule.

If its using something akin to LOCK RMW, well, the l2 cache line is your
friend.

Yes, different mutexes/locks/etc. need to be spread apart from each other,
to avoid accidental contention. However, if you have a huge number of mutexes/locks/etc. then it's less important, since by design you're
spreading accesses out over "enough", so you're less likely to get any contention. If you have 300 CPUs and 10,000 database rows with a mutex
per row, then contention is not likely a big deal.

My quick look at C++:

#include <mutex>
std::mutex sample_mutex;

show sample_mutex is 48 bytes. This includes OS/thread overhead to put the thread to sleep if the fast-grab algorithm doesn't work after a few tries, along with some internal alignment padding.

[...]

Ahhh, not a C++ mutex but words in memory wrt lock-free algos using
atomic RMW. They are different... Fwiw, check this out. A way to
"emulate" a C++ std::atomic using a table of mutexes:

read all when you get some time:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

Now, this is different because its NOT lock-free. Using say LOCK XADD is lock-free, BUT. We need to give it a break via amortization via distribution...

We don't want a shit load of threads hammering a word that is used for
LOCK XADD or even a LL/SC emulating a XADD.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Wed Jun 10 13:22:42 2026

From Newsgroup: comp.arch

On 6/10/2026 12:32 PM, Chris M. Thomasson wrote:

On 6/9/2026 6:28 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]

Do you understand how hard it is to perform Double-Compare-Double-Swap >>>> without something special in the protocol under withering contention ?? >>>

It sounds like you are trying to fulfill a lot of the academic papers?

However, DWCAS is a really important one. This is only a double width
compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are
non-contiguous, then things get into LL/SC like worlds...

DDCAS works better as both comparands are 64-bit and back-to-back.

Don't see how you can much to work with two 32-bit comparands.

[...]

Think of DWCAS on a 32-bit system. LOCK CMPXCHG8B on a 32 bit system. We
can do many interesting things with it. LOCK CMPXCHG16B on a 64-bit
system, ditto.

DWCAS only operates on two contiguous words. Well, that alone allows us
to create a lot of lock/wait free algos. The academic papers were
interested in DCAS, KCSS, ect... They never had to make the cpu that can
do that efficiently.

FWIW, DWCAS can do other things besides ABA. Think of using two 32 bit regions of memory and a DWCAS can store "offsets" into both of them.

Wrt CAS in general, this is a fun thread:

https://groups.google.com/g/comp.arch/c/1zNPUHo8YoI/m/jtq3VdG7BQAJ
--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Thu Jun 11 02:58:23 2026

From Newsgroup: comp.arch

In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/7/2026 8:23 AM, Kent Dickey wrote:
[...]

This avoids the need for any software changes, while directly speeding up
atomics and what atomics are doing.

What about weak vs strong CAS? Also how does emulating a LOCK XADD speed >things up? Now SC can fail, and we have to loop. How many times do we
have to loop to know what that emulated LOCK XADD finishes? Its akin to >emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
former is quite grand.

Breaking things apart works just as well for LL/SC as atomic RMW.

How does the LL/SC work wrt the arch? How fine grain is it? Is its >reservation granule the same size as say a L2 cache line for atomic RMW?

On systems with both LL/SC and atomic RMW, is it one size fits all?

LL/SC can suffer from live lock and false sharing.

LOCK XADD can suffer from false sharing, its not going to live lock on ya.

[...]

You trimmed everything, so there's really nothing I can respond to. You
have your view as to how atomics should work, and it's an interesting perspective, but not the only one.

If you had access to a system where std::mutex was faster and handled contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
than the best implementation on any system), would you use mutexes rather
than lock-free algorithms?

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Thu Jun 11 11:53:23 2026

From Newsgroup: comp.arch

On 6/10/2026 7:58 PM, Kent Dickey wrote:

In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/7/2026 8:23 AM, Kent Dickey wrote:
[...]

This avoids the need for any software changes, while directly speeding up >>> atomics and what atomics are doing.

What about weak vs strong CAS? Also how does emulating a LOCK XADD speed
things up? Now SC can fail, and we have to loop. How many times do we
have to loop to know what that emulated LOCK XADD finishes? Its akin to
emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
former is quite grand.

Breaking things apart works just as well for LL/SC as atomic RMW.

How does the LL/SC work wrt the arch? How fine grain is it? Is its
reservation granule the same size as say a L2 cache line for atomic RMW?

On systems with both LL/SC and atomic RMW, is it one size fits all?

LL/SC can suffer from live lock and false sharing.

LOCK XADD can suffer from false sharing, its not going to live lock on ya. >>
[...]

You trimmed everything, so there's really nothing I can respond to. You
have your view as to how atomics should work, and it's an interesting perspective, but not the only one.

If you had access to a system where std::mutex was faster and handled contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
than the best implementation on any system), would you use mutexes rather than lock-free algorithms?

Hummm...

std::mutex is not lock-free by default.

And more importantly, a LOCK XADD is fundamentally different from using
a std::mutex to emulate one. They are not two implementations of the
same thing.

LOCK XADD is a single atomic instruction, one hardware transaction, no
kernel involvement, no thread blocking, no scheduler interaction. Done.

A std::mutex wrapping an increment is a completely different beast. Even
in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse —
threads block, context switches happen, the OS scheduler gets involved.

So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation — it's what guarantees and costs come with the primitive. LOCK XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties regardless of how fast it is.

There are also contexts where you literally cannot use a mutex — signal handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice.
--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Thu Jun 11 21:30:18 2026

From Newsgroup: comp.arch

In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/10/2026 7:58 PM, Kent Dickey wrote:

In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

[snip]

On systems with both LL/SC and atomic RMW, is it one size fits all?

LL/SC can suffer from live lock and false sharing.

LOCK XADD can suffer from false sharing, its not going to live lock on ya. >>>
[...]

You trimmed everything, so there's really nothing I can respond to. You
have your view as to how atomics should work, and it's an interesting
perspective, but not the only one.

If you had access to a system where std::mutex was faster and handled
contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
than the best implementation on any system), would you use mutexes rather
than lock-free algorithms?

Hummm...

std::mutex is not lock-free by default.

And more importantly, a LOCK XADD is fundamentally different from using
a std::mutex to emulate one. They are not two implementations of the
same thing.

Lock-free algorithms solve two problems of mutexes: they are immune to a
thread being reschedules while holding a lock; and they are a response
to poor performance of locks on many systems. Remote-memory atomics
(doing the atomic in the fabric or at the memory controller) are a
response to the poor contention performance of locks or atomics (on lock-free or locks) being done by the CPU on many platforms.

But if we fix those two problems, and make locks fast, then there's much
less need for lock-free algorithms.

Lock-free also solves the thread preemption issue--a thread that wins a mutex and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.

LOCK XADD is a single atomic instruction, one hardware transaction, no >kernel involvement, no thread blocking, no scheduler interaction. Done.

A std::mutex wrapping an increment is a completely different beast. Even
in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse — >threads block, context switches happen, the OS scheduler gets involved.

So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation — it's what guarantees and costs come with the primitive. LOCK >XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties >regardless of how fast it is.

You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences. I didn't describe this in that post, but if you add in a CPU
which supports sequential ordering with the mutex Transaction Prediction
logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.

There are also contexts where you literally cannot use a mutex — signal >handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice.

If you have non-interruptable user mutexes, then it can work in these
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions. A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW implementation. Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.

What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.
Lock-free algorithms make it much harder for hardware to see what is the
real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new
item's next-> to point to that, then doing Compare-and-Swap on the head
pointer with our item pointer. It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the
CAS to the head pointer, and can try to make that whole sequence atomic,
but this is just a trivial lock-free case).

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Jun 12 02:16:32 2026

From Newsgroup: comp.arch

[email protected] (Kent Dickey) posted:

ERROR "unexpected byte sequence starting at index 2542: '\xE2'" while decoding:

In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/10/2026 7:58 PM, Kent Dickey wrote:

In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

[snip]

On systems with both LL/SC and atomic RMW, is it one size fits all?

LL/SC can suffer from live lock and false sharing.

LOCK XADD can suffer from false sharing, its not going to live lock on ya.

[...]

You trimmed everything, so there's really nothing I can respond to. You >> have your view as to how atomics should work, and it's an interesting
perspective, but not the only one.

If you had access to a system where std::mutex was faster and handled
contention better than DCAS, CAS, and LOCK XADD in all cases (and faster >> than the best implementation on any system), would you use mutexes rather >> than lock-free algorithms?

Hummm...

std::mutex is not lock-free by default.

And more importantly, a LOCK XADD is fundamentally different from using
a std::mutex to emulate one. They are not two implementations of the
same thing.

Lock-free algorithms solve two problems of mutexes: they are immune to a thread being reschedules while holding a lock; and they are a response
to poor performance of locks on many systems. Remote-memory atomics
(doing the atomic in the fabric or at the memory controller) are a
response to the poor contention performance of locks or atomics (on lock-free or locks) being done by the CPU on many platforms.

But if we fix those two problems, and make locks fast, then there's much
less need for lock-free algorithms.

Lock-free also solves the thread preemption issue--a thread that wins a mutex and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.

The LL/SC sequence on My 66000 implementations is guaranteed to fail
when an interrupt (or exception) causes a transfer of control out of
the thread, not while running the critical section, but while acquiring
or releasing the mutex.

LOCK XADD is a single atomic instruction, one hardware transaction, no >kernel involvement, no thread blocking, no scheduler interaction. Done.

A std::mutex wrapping an increment is a completely different beast. Even >in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before >you've even touched the data. Under contention it gets much worse â >threads block, context switches happen, the OS scheduler gets involved.

So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation â it's what guarantees and costs come with the primitive. LOCK
XADD gives you atomicity with no blocking, no kernel, no priority >inversion, no deadlock risk. A mutex gives you none of those properties >regardless of how fast it is.

You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences.

Do you agree that an ISA cannot make an atomic-acquire or atomic-release
any faster than a core can perform the same instruction sequence without
any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).

I didn't describe this in that post, but if you add in a CPU which supports sequential ordering with the mutex Transaction Prediction logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.

The LL/SC semantics on My 66000 directly supports using as many as
8 cache lines in a single ATOMIC event with the property that all
3rd party observers only see the before and after states of all
cache lines. From the beginning of the first LL to the end with an
SC, all such participating memory modifications appear atomic.

Thus, there was no need for the concept of nesting nor serialization.
One does not have to get 1 mutex and then get another--you can get both
at the same time (should you need that).

This brings up the notion of the difference between an ATOMIC-event
and a critical section. An ATOMIC-event is a unit of work that all
interested 3rd parties can all agree on "order" {Joe got it Sally
did not}. All of this can be handled by addresses, without consideration
of any data value (address values yes, data values, no). Acquire-mutex
and release-mutex are ATOMIC based on address. While the rest of the
critical section is primarily based on data (and pointers used as data).

There are also contexts where you literally cannot use a mutex â signal
handlers, certain real-time contexts, interrupt service routines. >Lock-free is not just a performance choice, it's sometimes the only choice.

If you have non-interruptable user mutexes, then it can work in these
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions.

It also requires a control transfer point so that if an interrupt is
not able to be deferred, that the mutex guarded critical section has
a way to KNOW that the un-interruptible-event was, indeed, interrupted.
My 66000 has this built into its pipelined version of LL/SC.

A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW implementation.

I like to say, My 66000 ATOMICs are not a form of TM but can be used to
make a SW version of TM have the appropriate semantics guarantees.

Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.

What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.

From the HW point of view (and the programming model of My 66000),
There are 3 noted events:
a) acquisition of the mutex(s)
b) transit of the critical section
c) release of the mutex(s).

The problem with old style ATOMICs is that while (a) is easily visible
as ATOMIC, (c) is not (although a predictor would likely work quite
well.

The problem with CAS style ATOMICs is that the initial LD(s) are not
seen as ATOMIC until the CAS is performed, opening a window for ABA.

The problem with LL/SC style ATOMICs is that while the whole sequence
is known to be a critical section, spurious failures are allowed in
most implementations.

It is possible with My 66000 ATOMICs to perform the whole critical section WITHIN a single ATOMIC-event.

Lock-free algorithms make it much harder for hardware to see what is the
real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new item's next-> to point to that, then doing Compare-and-Swap on the head pointer with our item pointer.

Try that on a doubly linked list. That is where DCADS comes in.

It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the CAS to the head pointer, and can try to make that whole sequence atomic,
but this is just a trivial lock-free case).

Why thinking about this this afternoon, I am left wondering if the page
tables could be augmented with a bit indicating that ATOMIC stuff is going
on on this page ??

Kent

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Jun 12 02:20:34 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 6/9/2026 6:33 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

-------------------------

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state
machines and such.

[...]

The main point is that the software needs to pad and align accordingly.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an >> atomic RMW using L2 cache lines.

If you want the strong CAS it is written::

LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

If you want weak CAS it is written:

LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

It just changes the length of time interference is being observed.

Pretty interesting! So, we can have spurious failure on the weak CAS and none on the strong CAS?

It is not zero, but it is exponentially fewer; and no architecture is
going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).

strong CAS failure can be used in a state machine, weak CAS failure cannot...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Fri Jun 12 02:43:22 2026

From Newsgroup: comp.arch

On 6/11/2026 2:30 PM, Kent Dickey wrote:

Will get back to you but...

What's nice about a mutex, which is basically a user-land spinlock

NO! a mutex is not a spinlock. A mutex can allow a thread to wait in the kernel. A spinlock is way different. Now there are adpative mutex that
can spin a bit before they hit the slow path, kernel wait.

Fwiw, I have fun way to use mutexes using try_lock, but that is another
story.

Also, relaxed arch with competent programmers can make things fly. Btw,
are you familiar with RCU?

_____

This is using the idea of trying to do other work while a mutex is
contended. Why spin or wait when we don't really have to? The per-thread random numbers in this experiment are only there to try to simulate when
a contended thread has any other work to do. This can be check
per-thread stacks, queues, ect... Here is my example code. Can you get
it to run, and can you show me the output? Now, this can work in real
world environments. It's a stare to an adaptive mutex that just spins
doing nothing... lol... ;^)

I compiled it using c++20

(read all...)
____________________________________
// A Fun Mutex Pattern? Or, a Nightmare? Humm...
// By: Chris M. Thomasson
//___________________________________________________

#include <iostream>
#include <random>
#include <numeric>
#include <algorithm>
#include <thread>
#include <atomic>
#include <mutex>

#define CT_WORKERS (42)
#define CT_ITERS (996699)
#define CT_BACKOFFS (42)
#define CT_RAND_MAX (20)
#define CT_RAND_THRESHOLD (5)

struct ct_shared
{
std::mutex m_fun_mutex;
std::atomic<unsigned long> m_other_work = { 0 };
int m_test_count0 = 0;

void
sanity_check_dump() const
{
std::cout << "(ct_shared:" << this << ")->" <<
"m_test_count0 = " << m_test_count0 << ", " <<
"m_other_work = " << m_other_work.load(std::memory_order_relaxed) << "\n";
}

bool
sanity_check_validate() const
{
return (m_test_count0 == CT_ITERS * CT_WORKERS);
}
};

void
ct_worker_entry(
ct_shared& shared
) {
//std::cout << "ct_worker_entry" << std::endl; // testing thread
race for sure...

// Thread Local...
std::random_device rnd_seed = { };
std::mt19937 rnd_gen(rnd_seed());
std::uniform_int_distribution<unsigned long> rnd_dist(0, CT_RAND_MAX);

for (unsigned long i = 0; i < CT_ITERS; ++i)
{
// Lock logic...
{
unsigned long backoff = 0;

while (! shared.m_fun_mutex.try_lock())
{
unsigned long rnd0 = rnd_dist(rnd_gen);

if (rnd0 > CT_RAND_THRESHOLD || backoff > CT_BACKOFFS)
{
shared.m_fun_mutex.lock();
break;
}

// do other work... :^)
shared.m_other_work.fetch_add(1,
std::memory_order_relaxed);

// but not too much work... ;^o
++backoff;
}
}

// Critical Section...
{
shared.m_test_count0 = shared.m_test_count0 + 1;
}

// Unlock
{
shared.m_fun_mutex.unlock();
}
}
}

int main()
{
// Hello... :^)
{
std::cout << "Hello ct_fun_mutex... lol? ;^) ver:(0.0.0)\n";
std::cout << "By: Chris M. Thomasson\n";
std::cout << "____________________________________________________\n";
std::cout.flush();
}

// Create our fun things... ;^)
ct_shared shared = { };
std::thread workers[CT_WORKERS] = { };

// Lanuch...
{
std::cout << "Launching Threads...\n";
std::cout.flush();

for (unsigned long i = 0; i < CT_WORKERS; ++i)
{
workers[i] = std::thread(ct_worker_entry, std::ref(shared));
}
}

// Join...
{
std::cout << "Joining Threads... (computing :^)\n";
std::cout.flush();
for (unsigned long i = 0; i < CT_WORKERS; ++i)
{
workers[i].join();
}
}

// Sanity Check...
{
shared.sanity_check_dump();

if (! shared.sanity_check_validate())
{
std::cout << "\n\n**** Pardon my French, but FUCK!!!!!
****\n" << std::endl;
}

else
{
std::cout << "\nWe are Sane!\n\n";
std::cout << "We completed " <<
shared.m_other_work.load(std::memory_order_relaxed) <<
" work items while waiting for the mutex..." << std::endl;
}
}

// Fin...
{
std::cout << "____________________________________________________\n";
std::cout << "Fin... :^)\n" << std::endl;
}

return 0;
}
____________________________________

Any luck? Its fun to see how many work items were completed when the
mutex was contended...

--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Fri Jun 12 15:15:20 2026

From Newsgroup: comp.arch

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

[email protected] (Kent Dickey) posted:

In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/10/2026 7:58 PM, Kent Dickey wrote:

In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

[snip]

Lock-free also solves the thread preemption issue--a thread that wins a mutex
and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave >> some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.

The LL/SC sequence on My 66000 implementations is guaranteed to fail
when an interrupt (or exception) causes a transfer of control out of
the thread, not while running the critical section, but while acquiring
or releasing the mutex.

LOCK XADD is a single atomic instruction, one hardware transaction, no
kernel involvement, no thread blocking, no scheduler interaction. Done.

A std::mutex wrapping an increment is a completely different beast. Even >> >in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse â >> >threads block, context switches happen, the OS scheduler gets involved.

So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation it's what guarantees and costs come with the

primitive. LOCK

XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties
regardless of how fast it is.

You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences.

Do you agree that an ISA cannot make an atomic-acquire or atomic-release
any faster than a core can perform the same instruction sequence without
any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).

The obvious answer is no, and that is true for an uncontended atomic instruction. However, once contention comes in to play, anything that
reduces the contention storm can be a big win. It's not clear how to
reduce that storm when it's caused by ordinary LD/ST, but I see lots of
ways to reduce it much of the time if HW can detect the operation's
boundaries.

I didn't describe this in that post, but if you add in a CPU
which supports sequential ordering with the mutex Transaction Prediction
logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.

The LL/SC semantics on My 66000 directly supports using as many as
8 cache lines in a single ATOMIC event with the property that all
3rd party observers only see the before and after states of all
cache lines. From the beginning of the first LL to the end with an
SC, all such participating memory modifications appear atomic.

Thus, there was no need for the concept of nesting nor serialization.
One does not have to get 1 mutex and then get another--you can get both
at the same time (should you need that).

This brings up the notion of the difference between an ATOMIC-event
and a critical section. An ATOMIC-event is a unit of work that all
interested 3rd parties can all agree on "order" {Joe got it Sally
did not}. All of this can be handled by addresses, without consideration
of any data value (address values yes, data values, no). Acquire-mutex
and release-mutex are ATOMIC based on address. While the rest of the
critical section is primarily based on data (and pointers used as data).

Transactional Memory from 15 years ago was going to fundamentally change
the way multi-threaded apps worked. And then it fizzled out. Because
the goal was laudable: remove locks and atomic altogether, and make
adding a record to a database be just:

START_TXN();
add_record_to_database();
END_TXN();

and just do add_record_to_database() effectively atomically. The problem
is that edge cases mean sometimes add_record_to_database() can be 5-10 instructions, and sometimes it's thousands. And Transactional Memory
is very complex for hardware to get right, and it's hard for software to
get right too (since software cannot let there be thousands of instructions inside the TXN, so it needs to be rewritten).

Basically, there are too many unsolved complex problems. Some of which
apply to your ESM--needing software changes is a big problem. And so far
no one has shown how to exhaustively validate this kind of mechanism.
Because there's no real fallback: it has to work, it's not a feature that
can just be "turned off".

There are also contexts where you literally cannot use a mutex â signal
handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice. >>

If you have non-interruptable user mutexes, then it can work in these
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions.

It also requires a control transfer point so that if an interrupt is
not able to be deferred, that the mutex guarded critical section has
a way to KNOW that the un-interruptible-event was, indeed, interrupted.
My 66000 has this built into its pipelined version of LL/SC.

A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW
implementation.

I like to say, My 66000 ATOMICs are not a form of TM but can be used to
make a SW version of TM have the appropriate semantics guarantees.

I've had more time to think about my vague description on how to make an non-interruptible mutex, and now I think it's not a good idea. I think
trying to hold off actual interrupts is still a good idea, but not to have
HW try to back out of the mutex operation itself. It's just too much like Transactional Memory--rather than being an optional acceleration which can
be turned off a problem is found, non-interruptible mutex must always be correct and must provide guarantees, and so it's wandered into the minefield
of Transactional Memory, and it's best to stay out of there.

Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free
algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.

What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.

From the HW point of view (and the programming model of My 66000),
There are 3 noted events:
a) acquisition of the mutex(s)
b) transit of the critical section
c) release of the mutex(s).

The problem with old style ATOMICs is that while (a) is easily visible
as ATOMIC, (c) is not (although a predictor would likely work quite
well.

Yes, I agree. But note the release of the mutex is to the same address
as the atomic instruction, so it can be detected after it's been seen.

The problem with CAS style ATOMICs is that the initial LD(s) are not
seen as ATOMIC until the CAS is performed, opening a window for ABA.

Agreed, and it's hard to pick out the preparation work being done for
these lock-free algorithms.

The problem with LL/SC style ATOMICs is that while the whole sequence
is known to be a critical section, spurious failures are allowed in
most implementations.

I am not a fan of LL/SC. It's possible to get it right in HW, but most designers underestimate the invasiveness of LL/SC--the entire load/store
unit needs little modifications everywhere to make LL/SC work efficiently
and reliably. You want to implement transient memory or writes which
bypass the cache, or you speculatively execute instructions past the
end of the LL/SC loop--make sure you solve how they interact with LL/SC.
I've yet to meet any design which got it "right" the first try (but once
a design team learns, then they get it right, until another new system
feature pops up and breaks it).

LL/SC is the academic solution to building "let's allow anything" atomics.
"We can let the user build the atomic operations are their dreams, and
we don't have to solidify it in HW, we just let them write code."
It's like a preliminary Transactional Memory--we can let the users do a
lot with us just needing to do a little bit in hardware.

And then when LL/SC gets implemented, you basically end up with
Fetch-and-Add, Swap, and Compare-and-Swap (and other trivial ops, like Fetch-and-AND, etc.) only. The RISC-V restrictions are pretty intense:
You can do LR; you can then do a handful of integer instructions (no FP,
no vector) not including any ld/st or system instructions, nor any
backward branches, no function calls; and then SC, then you get the same restrictions but now you get a backwards branch. What exactly are you
able to implement with these restrictions other than the usual atomic operations that now most everyone puts in HW? Note on RISC-V, if an implementation has LR/SC, then it must ALSO have the atomic instructions Fetch-and-Add, Swap, Compare-and-Swap, and the other simple ones. So why bother with LR/SC?

I understand where RISC-V's LL/SC restrictions are coming from. I get it. Which is why I don't like LL/SC.

It is possible with My 66000 ATOMICs to perform the whole critical section >WITHIN a single ATOMIC-event.

Lock-free algorithms make it much harder for hardware to see what is the
real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new
item's next-> to point to that, then doing Compare-and-Swap on the head
pointer with our item pointer.

Try that on a doubly linked list. That is where DCADS comes in.

When I say Compare-and-Swap, I implicitly include DCAS. DCADS is more
exotic, so if I mean that, I'd explicitly list it.

It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the >> CAS to the head pointer, and can try to make that whole sequence atomic,
but this is just a trivial lock-free case).

Why thinking about this this afternoon, I am left wondering if the page >tables could be augmented with a bit indicating that ATOMIC stuff is going >on on this page ??

I'm just not sure that provides much help.

To me, the key insight which I wanted to get across was that with a mutex
(and spinlock), it's a huge advantage to NOT let other CPUs ever see the
mutex lock location in the locked state. It solves the contention storm,
and yet it's "optional" meaning if hardware has to "give up", it's still correct (just lower performance). And that lock-free algorithms would
get the same benefit IF IT CAN BE DETECTED, but that it's much harder
for HW to detect (since the lock-free set of operations is potentially
much larger and more complex).

And that with HW support to solve the contention storm, a mutex (and
a spinlock, which is inside a mutex), which is easy for software to use,
could become faster than lock-free algorithms (when spinlocks or mutexes
make sense, since I'm abandoning the idea of making them work in interrupt handlers, etc.).

For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts, so I suggest making masking interrupts an extremely lightweight operation. When you see an
instruction which seems like it would mask interrupts in Fetch/Decode, immediately set a bit to block interrupts. Once the instruction
executes (or is discarded), clear that bit (maybe make it a 4-bit
counter, so you can speculate further). Don't serialize the instruction
that masks off interrupts.

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Jun 12 17:11:29 2026

From Newsgroup: comp.arch

[email protected] (Kent Dickey) posted:

ERROR "unexpected byte sequence starting at index 1569: '\xC3'" while decoding:

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

[email protected] (Kent Dickey) posted:

In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

On 6/10/2026 7:58 PM, Kent Dickey wrote:

In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:

[snip]

Lock-free also solves the thread preemption issue--a thread that wins a mutex
and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave
some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.

The LL/SC sequence on My 66000 implementations is guaranteed to fail
when an interrupt (or exception) causes a transfer of control out of
the thread, not while running the critical section, but while acquiring
or releasing the mutex.

LOCK XADD is a single atomic instruction, one hardware transaction, no >> >kernel involvement, no thread blocking, no scheduler interaction. Done. >> >
A std::mutex wrapping an increment is a completely different beast. Even >> >in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse Ã¢ÂÂ
threads block, context switches happen, the OS scheduler gets involved. >> >
So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation it's what guarantees and costs come with the

primitive. LOCK

XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties >> >regardless of how fast it is.

You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences.

Do you agree that an ISA cannot make an atomic-acquire or atomic-release >any faster than a core can perform the same instruction sequence without >any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).

The obvious answer is no, and that is true for an uncontended atomic instruction. However, once contention comes in to play, anything that reduces the contention storm can be a big win.

The main motivation for allowing a single ATOMIC-event to perform data manipulation on more than one memory location is to reduce the total
number of ATOMIC-events--and thereby reducing the amount of contention.

It's not clear how to
reduce that storm when it's caused by ordinary LD/ST, but I see lots of
ways to reduce it much of the time if HW can detect the operation's boundaries.

When I designed ASF at AMD (2005), it was clear that one of the problems
of "regular" synchronization was that the release* of the mutex was not
marked in some way as ATOMIC--it was a ST #0 (or equivalent). SC converts
that ST into something that is easily recognized as ATOMIC.

(*) test_and_set was::

TS Rd,[address]
BNE0 label
...
... // critical section work
...
ST #0,[address] // takes address comparator to see
// this is part of an ATOMIC-event

I didn't describe this in that post, but if you add in a CPU >> which supports sequential ordering with the mutex Transaction Prediction >> logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.

The LL/SC semantics on My 66000 directly supports using as many as
8 cache lines in a single ATOMIC event with the property that all
3rd party observers only see the before and after states of all
cache lines. From the beginning of the first LL to the end with an
SC, all such participating memory modifications appear atomic.

Thus, there was no need for the concept of nesting nor serialization.
One does not have to get 1 mutex and then get another--you can get both
at the same time (should you need that).

This brings up the notion of the difference between an ATOMIC-event
and a critical section. An ATOMIC-event is a unit of work that all >interested 3rd parties can all agree on "order" {Joe got it Sally
did not}. All of this can be handled by addresses, without consideration
of any data value (address values yes, data values, no). Acquire-mutex
and release-mutex are ATOMIC based on address. While the rest of the >critical section is primarily based on data (and pointers used as data).

Transactional Memory from 15 years ago was going to fundamentally change
the way multi-threaded apps worked. And then it fizzled out. Because
the goal was laudable: remove locks and atomic altogether, and make
adding a record to a database be just:

START_TXN();
add_record_to_database();
END_TXN();

and just do add_record_to_database() effectively atomically. The problem
is that edge cases mean sometimes add_record_to_database() can be 5-10 instructions, and sometimes it's thousands. And Transactional Memory
is very complex for hardware to get right, and it's hard for software to
get right too (since software cannot let there be thousands of instructions inside the TXN, so it needs to be rewritten).

The TM model was fundamentally flawed--SW was assuming the amount of data manipulation in a single transaction was unbounded:: HW does not do unbounded well.

Basically, there are too many unsolved complex problems. Some of which
apply to your ESM--needing software changes is a big problem. And so far
no one has shown how to exhaustively validate this kind of mechanism.
Because there's no real fallback: it has to work, it's not a feature that
can just be "turned off".

All of the now current atomic::stuff can be directly implemented in ESM.
It has to be that way for any chance of success !! However, in order for
SW to address the exponent of interference, the std::atomic::stuff needs
some additional functionality. ESM has the ability to allow all the
requisite SW work/experimentation to be performed without HW additions
(as long as the ATOMIC-event does not need more than 8 cache lines of
data to be manipulated in a single event.)

There are also contexts where you literally cannot use a mutex Ã¢ÂÂ signal
handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice.

If you have non-interruptable user mutexes, then it can work in these
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions.

It also requires a control transfer point so that if an interrupt is
not able to be deferred, that the mutex guarded critical section has
a way to KNOW that the un-interruptible-event was, indeed, interrupted.
My 66000 has this built into its pipelined version of LL/SC.

A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW
implementation.

I like to say, My 66000 ATOMICs are not a form of TM but can be used to >make a SW version of TM have the appropriate semantics guarantees.

I've had more time to think about my vague description on how to make an non-interruptible mutex, and now I think it's not a good idea. I think trying to hold off actual interrupts is still a good idea, but not to have
HW try to back out of the mutex operation itself.

Agreed, a mutex is a SW model, while an ATOMIC-event is a HW model.

It's just too much like Transactional Memory--rather than being an optional acceleration which can
be turned off a problem is found, non-interruptible mutex must always be correct and must provide guarantees, and so it's wandered into the minefield of Transactional Memory, and it's best to stay out of there.

I saw this in 2004...

Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free >> algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.

What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.

From the HW point of view (and the programming model of My 66000),
There are 3 noted events:
a) acquisition of the mutex(s)
b) transit of the critical section
c) release of the mutex(s).

The problem with old style ATOMICs is that while (a) is easily visible
as ATOMIC, (c) is not (although a predictor would likely work quite
well.

Yes, I agree. But note the release of the mutex is to the same address
as the atomic instruction, so it can be detected after it's been seen.

Way back when ASF was being born; another address comparator might
have reasonable arguments against it. Now not so much. In any event,
one does not want this address comparator to add to thread.state and
impact ISR times; that is why ASF and ESM adopted a model whereby
any ATOMIC-event was guaranteed to fail if an exception/interrupt
transpired. This also gets rid of excessive thread.state at context
switch time.

The problem with CAS style ATOMICs is that the initial LD(s) are not
seen as ATOMIC until the CAS is performed, opening a window for ABA.

Agreed, and it's hard to pick out the preparation work being done for
these lock-free algorithms.

The problem with LL/SC style ATOMICs is that while the whole sequence
is known to be a critical section, spurious failures are allowed in
most implementations.

I am not a fan of LL/SC. It's possible to get it right in HW, but most designers underestimate the invasiveness of LL/SC--the entire load/store
unit needs little modifications everywhere to make LL/SC work efficiently
and reliably. You want to implement transient memory or writes which
bypass the cache, or you speculatively execute instructions past the
end of the LL/SC loop--make sure you solve how they interact with LL/SC.
I've yet to meet any design which got it "right" the first try (but once
a design team learns, then they get it right, until another new system feature pops up and breaks it).

LL/SC is the academic solution to building "let's allow anything" atomics.

The academic ?solutions? fail to have the property that some kind of
automatic flow-control is performed when it becomes known that the
event will fail. ASF had a primitive version, ESM was a well worked out version.

If one simply codes ESM to perform Test_and_set, HW conspires to actually implement Test_and_test_and_set--due to the way the above paragraph is performed in HW.

"We can let the user build the atomic operations are their dreams, and
we don't have to solidify it in HW, we just let them write code."

This is ripe for undefined HW behavior--or worse, slightly varying
behavior as time goes on.

It's like a preliminary Transactional Memory--we can let the users do a
lot with us just needing to do a little bit in hardware.

See how well that worked out ?!? not ...

And then when LL/SC gets implemented, you basically end up with Fetch-and-Add, Swap, and Compare-and-Swap (and other trivial ops, like Fetch-and-AND, etc.) only. The RISC-V restrictions are pretty intense:
You can do LR; you can then do a handful of integer instructions (no FP,
no vector) not including any ld/st or system instructions, nor any
backward branches, no function calls; and then SC, then you get the same restrictions but now you get a backwards branch.

My 66000 has none of these restrictions, BTW. The only caveat is that
as the ATOMIC-event grows larger, the window for interference also
grows larger and failures happen more often.

What exactly are you
able to implement with these restrictions other than the usual atomic operations that now most everyone puts in HW?

DCADS, TCADS, TCATS, ...

In additions, one can "do the ATOMIC-thing and also store a marker
so debugger can figure out WHO did this to the concurrent data
structure; One can add a time figure to the CDS, ... there are
a lot of things one can do to make these things more debuggable,
robust, {RAS arguments} by increasing the amount of data that
can be performed in a single ATOMIC event.

And then there is the reduction in the number of ATOMIC-events,
which directly addresses interference.

Note on RISC-V, if an implementation has LR/SC, then it must ALSO have the atomic instructions Fetch-and-Add, Swap, Compare-and-Swap, and the other simple ones. So why bother with LR/SC?

This is what happens when one approaches ATOMIC-stuff with an academic mentality.

I understand where RISC-V's LL/SC restrictions are coming from. I get it. Which is why I don't like LL/SC.

In My 66000, when/if control arrives at the failure-point, one can access
a control register that contains WHY the event failed. ) means success, negative means spurious, while positive is a direct measure of the amount
of interference you encountered. The last one can be used in a subsequent ATOMIC-event proactively to go after a unit of work that no other thread
is going after, and thereby eliminate interference. It is just that SW has
to be "in that game", HW cannot do it by itself. No SW is currently written like this because no HW provides that kind of functionality.

Chicken and Egg.

It is possible with My 66000 ATOMICs to perform the whole critical section >WITHIN a single ATOMIC-event.

Lock-free algorithms make it much harder for hardware to see what is the >> real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new >> item's next-> to point to that, then doing Compare-and-Swap on the head
pointer with our item pointer.

Try that on a doubly linked list. That is where DCADS comes in.

When I say Compare-and-Swap, I implicitly include DCAS. DCADS is more exotic, so if I mean that, I'd explicitly list it.

It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the >> CAS to the head pointer, and can try to make that whole sequence atomic, >> but this is just a trivial lock-free case).

Why thinking about this this afternoon, I am left wondering if the page >tables could be augmented with a bit indicating that ATOMIC stuff is going >on on this page ??

I'm just not sure that provides much help.

To me, the key insight which I wanted to get across was that with a mutex (and spinlock), it's a huge advantage to NOT let other CPUs ever see the mutex lock location in the locked state. It solves the contention storm,
and yet it's "optional" meaning if hardware has to "give up", it's still correct (just lower performance).

Consider that thread.A gets the mutex, then a high priority interrupt
needs and ISR, and a context switch to a higher priority task. There
is NO WAY for HW not to let everyone see the mutex in its locked state.
The problem is that the mutex gets is state by having a datum modified !

Whereas: in HW one can perform the near-mutex based solely on the address without any data being modified--since the address is subject to SNOOPs.
This puts HW in a position where it can do a control-transfer before
taking an exception/interrupt--thereby backing out of the near-mutex event. This is a key difference between MY 66000 ESM (and AMD ASF) and academic
LL/SC. The amount of ATOMICally-modifyable data is another.

And that lock-free algorithms would
get the same benefit IF IT CAN BE DETECTED, but that it's much harder
for HW to detect (since the lock-free set of operations is potentially
much larger and more complex).

And that with HW support to solve the contention storm, a mutex (and
a spinlock, which is inside a mutex), which is easy for software to use, could become faster than lock-free algorithms (when spinlocks or mutexes
make sense, since I'm abandoning the idea of making them work in interrupt handlers, etc.).

2 Points::

1) ESM can be used in ISRs.
2) My 66000 has a very fast way to perform DPC/softIRQs based on its
interrupt table model obviating the need for ISRs to do ATOMIC-stuff.

For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,

Or transition to a priority level that no interrupt can raise.

so I suggest making masking interrupts an extremely lightweight operation. When you see an
instruction which seems like it would mask interrupts in Fetch/Decode, immediately set a bit to block interrupts. Once the instruction
executes (or is discarded), clear that bit (maybe make it a 4-bit
counter, so you can speculate further). Don't serialize the instruction
that masks off interrupts.

Most of what you say above is unnecessary in My 66000 ISRs.
a) the priority of the core is automagically updated with the priority
of the interrupt.
b) the thread.state of the interrupted thread is returned to memory
c) the thread.state of the interrupting thread is fetched from memory
d) the fetch of thread.state can start before the store of thread.state
e) MSI-X interrupt message is already in R0.

So, by the time control arrives, ISR has a complete thread.state, including
the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and near-40-bits from the interrupt so it can figure out what to do based on control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single instruction). When ISR performs SVR, SVR check pending threads in interrupt table, and transfers control to any pending thread higher than interrupted thread (SW does not have to check).

Interrupts do not need to be masked, as all ISRs of lesser priority have already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled
with no compiler-knowledge of it being an ISR (and no ASM instructions).

Kent

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun Jun 14 18:57:47 2026

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> posted:

Consider "Dining Philosophers" My 66000 style::

struct { uint64_t left, right,
time, wait, starve;
} philosopher;

Task Philosopher()
{
Local philosopher *p;

do{
if( get_forks( p )
{
eat();
put_forks( p );
}
else
{
philosophise( p ); // wait p->time
p->wait += p->time;
if( p->wait > p->starve );
dies( p ); // no return from here
}
} while( TRUE );
}

Where::

get_fork:
LDDL R3,[R1,#left] // start of event
LDDL R4,[R1,#right]
OR R5,R4,R3
BEQ R5,exit
STD #1,[R1,#left]
STDL #1,[R1,#right] // end of event
MOV R1,#1
RET
exit:
INVAL #EVENT // terminate event without writing anything
MOV R1,#0
RET

put_forks:
STD #0,[R1,#left]
STD #0,[R1,#right]
STD #0,[R1,#wait]
RET
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Sun Jun 14 13:45:05 2026

From Newsgroup: comp.arch

On 6/11/2026 7:20 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/9/2026 6:33 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

-------------------------

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state >>>> machines and such.

[...]

The main point is that the software needs to pad and align accordingly. >>>> Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS. >>>> LL/SC is working on a reservation granule which can be different than an >>>> atomic RMW using L2 cache lines.

If you want the strong CAS it is written::

LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

If you want weak CAS it is written:

LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

It just changes the length of time interference is being observed.

Pretty interesting! So, we can have spurious failure on the weak CAS and
none on the strong CAS?

It is not zero, but it is exponentially fewer; and no architecture is
going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).

Just to clarify, strong CAS should never spuriously fail. Right?

strong CAS failure can be used in a state machine, weak CAS failure
cannot...

--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Sun Jun 14 21:14:03 2026

From Newsgroup: comp.arch

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

[email protected] (Kent Dickey) posted:

[snip]

I'm just going to follow up on Linux kernel spinlocks.

For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,

Or transition to a priority level that no interrupt can raise.

so I suggest making masking
interrupts an extremely lightweight operation. When you see an
instruction which seems like it would mask interrupts in Fetch/Decode,
immediately set a bit to block interrupts. Once the instruction
executes (or is discarded), clear that bit (maybe make it a 4-bit
counter, so you can speculate further). Don't serialize the instruction
that masks off interrupts.

Most of what you say above is unnecessary in My 66000 ISRs.
a) the priority of the core is automagically updated with the priority
of the interrupt.
b) the thread.state of the interrupted thread is returned to memory
c) the thread.state of the interrupting thread is fetched from memory
d) the fetch of thread.state can start before the store of thread.state
e) MSI-X interrupt message is already in R0.

The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled. And the kernel has spinlocks it needs to
win. But if the kernel thread gets interrupted with a higher priority task once it gets the spinlock, but before it finishes with it, this is bad since some other thread now has to wait a long time for the spinlock to free up.
It also might cause deadlocks.

So the Linux macro is spinlock_irqsave(), which first masks interrupts,
then wins the spinlock (spinning until it wins). Then do the operation
needed, then do spinunlock_irqrestore() to release the spinlock and
allow interrupts again.

The basic idea is this should be fast. We DO NOT want a higher priority interrupt to occur while we own the spinlock. So we just need to
temporarily mask interrupts. And I'm suggesting this masking and unmasking
of interrupts should be very lightweight and fast.

I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe they've moved all the important stuff to not use spinlock_irqsave() anymore,
I don't know. It still shows up a lot in the Linux source though in various drivers at least.

So, by the time control arrives, ISR has a complete thread.state, including >the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and >near-40-bits from the interrupt so it can figure out what to do based on >control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single >instruction). When ISR performs SVR, SVR check pending threads in interrupt >table, and transfers control to any pending thread higher than interrupted >thread (SW does not have to check).

Interrupts do not need to be masked, as all ISRs of lesser priority have >already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled
with no compiler-knowledge of it being an ISR (and no ASM instructions).

No, during the kernel spinlock_irqsave(), you have to mask out all
interrupts, including higher-priority ones.

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Sun Jun 14 21:46:19 2026

From Newsgroup: comp.arch

[email protected] (Kent Dickey) writes:

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

[email protected] (Kent Dickey) posted:

[snip]

I'm just going to follow up on Linux kernel spinlocks.

For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,

Or transition to a priority level that no interrupt can raise.

The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled. And the kernel has spinlocks it needs to >win. But if the kernel thread gets interrupted with a higher priority task >once it gets the spinlock, but before it finishes with it, this is bad since >some other thread now has to wait a long time for the spinlock to free up.
It also might cause deadlocks.

So the Linux macro is spinlock_irqsave(), which first masks interrupts,
then wins the spinlock (spinning until it wins). Then do the operation >needed, then do spinunlock_irqrestore() to release the spinlock and
allow interrupts again.

The basic idea is this should be fast. We DO NOT want a higher priority >interrupt to occur while we own the spinlock. So we just need to
temporarily mask interrupts. And I'm suggesting this masking and unmasking >of interrupts should be very lightweight and fast.

I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe >they've moved all the important stuff to not use spinlock_irqsave() anymore, >I don't know. It still shows up a lot in the Linux source though in various >drivers at least.

So, by the time control arrives, ISR has a complete thread.state, including >>the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and >>near-40-bits from the interrupt so it can figure out what to do based on >>control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single >>instruction). When ISR performs SVR, SVR check pending threads in interrupt >>table, and transfers control to any pending thread higher than interrupted >>thread (SW does not have to check).

Interrupts do not need to be masked, as all ISRs of lesser priority have >>already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled >>with no compiler-knowledge of it being an ISR (and no ASM instructions).

No, during the kernel spinlock_irqsave(), you have to mask out all >interrupts, including higher-priority ones.

While true, spinlock_irqsave() will mask interrupts on the current core/hardware thread; Other cores/threads will continue running and
taking interrupts that are either 1-of-N interrupts or specifically
assigned to a core.

In any operating system where nested interrupts on a core are supported, the interrupt handling code must be very carefully coded to ensure that
the handler of a lower priority interrupt does not obtain resources
(e.g. locks) that the higher priority interrupt will require to complete processing the interrupt (classic deadlock). There are other issues
related to kernel thread stack sizes that also pop up with nested
interrupts.

Linux doesn't support nested interrupts on a core, rather it
defers the interrupt hanndling work to a kernel thread and returns
immediately from the interrupt, unless handling the interrupt is trivial.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@[email protected] to comp.arch on Sun Jun 14 20:18:13 2026

From Newsgroup: comp.arch

[email protected] (Kent Dickey) writes:

The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled.

Most of the spinlocks I've lived with were in a system with a notion
of a prioritized hierarchy of interrupt levels. Taking a lock was
an atomic operation; when your call returned, the lock was held _and_
no interrupt at or below the specified level could happen. There was
never an assumption that you could spin on a lock because some other
CPU would eventually release it; the SMP kernel could run with a single processor.

This was all in a BSD-ish world; SVR3 STREAMS added an entire parallel ecosystem of SMP issues.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Jun 15 17:55:57 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <[email protected]> posted:

On 6/11/2026 7:20 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/9/2026 6:33 PM, MitchAlsup wrote:

"Chris M. Thomasson" <[email protected]> posted:

On 6/7/2026 11:34 AM, MitchAlsup wrote:

-------------------------

Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong >>>> does not allow that to occur. So, it can be used for interesting state >>>> machines and such.

[...]

The main point is that the software needs to pad and align accordingly. >>>> Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS. >>>> LL/SC is working on a reservation granule which can be different than an >>>> atomic RMW using L2 cache lines.

If you want the strong CAS it is written::

LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

If you want weak CAS it is written:

LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]

It just changes the length of time interference is being observed.

Pretty interesting! So, we can have spurious failure on the weak CAS and >> none on the strong CAS?

It is not zero, but it is exponentially fewer; and no architecture is
going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).

Just to clarify, strong CAS should never spuriously fail. Right?

Yes.

strong CAS failure can be used in a state machine, weak CAS failure
cannot...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Tue Jun 16 12:19:07 2026

From Newsgroup: comp.arch

On 6/14/2026 2:46 PM, Scott Lurndal wrote:

[email protected] (Kent Dickey) writes:

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

[email protected] (Kent Dickey) posted:

[snip]

I'm just going to follow up on Linux kernel spinlocks.

For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,

Or transition to a priority level that no interrupt can raise.

The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled. And the kernel has spinlocks it needs to >> win. But if the kernel thread gets interrupted with a higher priority task >> once it gets the spinlock, but before it finishes with it, this is bad since >> some other thread now has to wait a long time for the spinlock to free up. >> It also might cause deadlocks.

So the Linux macro is spinlock_irqsave(), which first masks interrupts,
then wins the spinlock (spinning until it wins). Then do the operation
needed, then do spinunlock_irqrestore() to release the spinlock and
allow interrupts again.

The basic idea is this should be fast. We DO NOT want a higher priority
interrupt to occur while we own the spinlock. So we just need to
temporarily mask interrupts. And I'm suggesting this masking and unmasking >> of interrupts should be very lightweight and fast.

I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe >> they've moved all the important stuff to not use spinlock_irqsave() anymore, >> I don't know. It still shows up a lot in the Linux source though in various >> drivers at least.

So, by the time control arrives, ISR has a complete thread.state, including >>> the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and
near-40-bits from the interrupt so it can figure out what to do based on >>> control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single
instruction). When ISR performs SVR, SVR check pending threads in interrupt >>> table, and transfers control to any pending thread higher than interrupted >>> thread (SW does not have to check).

Interrupts do not need to be masked, as all ISRs of lesser priority have >>> already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled
with no compiler-knowledge of it being an ISR (and no ASM instructions).

No, during the kernel spinlock_irqsave(), you have to mask out all
interrupts, including higher-priority ones.

While true, spinlock_irqsave() will mask interrupts on the current core/hardware thread; Other cores/threads will continue running and
taking interrupts that are either 1-of-N interrupts or specifically
assigned to a core.

In any operating system where nested interrupts on a core are supported, the interrupt handling code must be very carefully coded to ensure that
the handler of a lower priority interrupt does not obtain resources
(e.g. locks) that the higher priority interrupt will require to complete processing the interrupt (classic deadlock). There are other issues
related to kernel thread stack sizes that also pop up with nested
interrupts.

Linux doesn't support nested interrupts on a core, rather it
defers the interrupt hanndling work to a kernel thread and returns immediately from the interrupt, unless handling the interrupt is trivial.

Lock/wait free in the interrupt handlers are usually fine. Makes things
oh so much better. Actually, iirc is sem_post async safe? I think it is.
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,123
Nodes:	10 (0 / 10)
Uptime:	32:25:32
Calls:	14,371
Files:	186,380
D/L today:	570 files (158M bytes)
Messages:	2,540,570

Synchronization

Who's Online

System Info