This is mainly for Chris M. Thomasson but everyone is free to pile on::
Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::
define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties
see that the unit of work was either never started or finished completely without any intermediate state being observed by any party other than the core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} fall under this definition.
define: memory-atomic as a unit of work identified in a core and packaged
up to be performed by the cache hierarchy or PCIe device control registers. Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- of-work are performed in the cache having write-permission {EXCLUSIVE or MODIFIED cache line states}. When unCacheable, these units of work are performed directly on the addressed control register. In both cases, any interested party sees only the previous state or the after state.
Core-atomic events have the cache hierarchy performing a number (often quadratic) cache line transfers to satisfy a single event ({failing
(n-1) others when under contention}.
Memory-atomic events cannot escape the interconnect latency overhead
but move no data around that does not have to be moved.
Question: If you were starting lout with a clean slate <as it were>,
and not knowing if the atomic-shared-library is to be used in places
where core count was unknown AND interference level was unknown::
Would you implement the atomic-shared-library using core-atomic inst- ructions or would you implement the library using memory-atomic inst- ructions ?? Or would you implement some of the library as core-atomic
and other parts as memory-atomic?? and why!
Auxiliary question: Given the definition <above> of memory-atomic::
are not all to-memory packaged events not ATOMIC in the Lamport sense
as long as the package leaves the core sequentially consistent ??
On 5/27/2026 2:44 PM, MitchAlsup wrote:[snip atomic op context]
Question: If you were starting lout with a clean slate <as it
were>,
and not knowing if the atomic-shared-library is to be used in
places
where core count was unknown AND interference level was unknown::
Would you implement the atomic-shared-library using
core-atomic inst-
ructions or would you implement the library using
memory-atomic inst-
ructions ?? Or would you implement some of the library as
core-atomic
and other parts as memory-atomic?? and why!
Auxiliary question: Given the definition <above> of
memory-atomic::
are not all to-memory packaged events not ATOMIC in the
Lamport sense
as long as the package leaves the core sequentially consistent ??
Both, but the choice is almost "secondary" in a sense as to
whether the programmer has done their job first... Also, well, I
am quite fond of relaxed systems. Shit. Perhaps not at relaxed
as an alpha, wrt data-dependent loads.
Before you even pick the instruction class, the programmer must
eliminate false sharing. If the target is an LL/SC architecture,
they really _need_ to know the reservation granule size and
pad/align accordingly. For RMW like LOCK XADD, never straddle a
l2 cache line boundary — pad and align to it. These aren't
optional courtesies, they're prerequisites...?
Actually, an arch that has both LL/SC and atomic RMW, well,
that's pretty nice. I think AMD has that? Give the programmer
the ability to choose.
Beyond that, the deeper answer is: don't hammer a single
location at all! damn it. A single lock-free stack under
contention is a problem regardless of whether you're using
core-atomic or memory-atomic instructions.
A hash table of
lock/wait-free stacks/queues distributes that pressure. There
are single-producer/single consumer (SPSC),
multi-producer-single consumer (MPSC), all the way to (MPMC)...
Each algo is optimized for its usage pattern. Design for clever
distribution first, then pick your atomics. Actually if we map
the per threads to the hyper threads, cores, cpu's and numa
nodes in a fractal way, then we can scale fairly well... Gain
the topology and localize it in a sense...
For something like a shared counter, I'd go thread-local first,
aggregated per-CPU, then per-NUMA node, etc. Keep sharing as
local as possible — affinity masks in user space to map the
topology, and if you must share outside per-thread variables,
share local, as local as possible...
On 5/28/26 3:37 PM, Chris M. Thomasson wrote:
On 5/27/2026 2:44 PM, MitchAlsup wrote:[snip atomic op context]
Question: If you were starting lout with a clean slate <as it were>,
and not knowing if the atomic-shared-library is to be used in places
where core count was unknown AND interference level was unknown::
Would you implement the atomic-shared-library using core-atomic inst-
ructions or would you implement the library using memory-atomic inst-
ructions ?? Or would you implement some of the library as core-atomic
and other parts as memory-atomic?? and why!
Auxiliary question: Given the definition <above> of memory-atomic::
are not all to-memory packaged events not ATOMIC in the Lamport sense
as long as the package leaves the core sequentially consistent ??
Both, but the choice is almost "secondary" in a sense as to whether
the programmer has done their job first... Also, well, I am quite fond
of relaxed systems. Shit. Perhaps not at relaxed as an alpha, wrt
data-dependent loads.
Before you even pick the instruction class, the programmer must
eliminate false sharing. If the target is an LL/SC architecture, they
really _need_ to know the reservation granule size and pad/align
accordingly. For RMW like LOCK XADD, never straddle a l2 cache line
boundary — pad and align to it. These aren't optional courtesies,
they're prerequisites...?
If the operation is performed "centrally" (owning cache, I/O
device), then cache block size may not be important: the cache
can avoid performance problems associated with false sharing
(the singular cache/device acts as a single agent performing
all the operations).
While it would not add significant latency (or significantly
reduce throughput) — I think —, it _seems_ that it would also be
possible for different levels of cache to have different
granularities with respect to exported atomic operations. E.g.,
a cluster of cores sharing L2 cache might use a different
granule for exported atomic operations than L3 (shared by
multiple clusters). A smaller L3 granule would seem to allow for
high contention among many distributed threads with the
assumption that contention within a cluster is necessarily
limited; this might be more a concern when L3 slices are
address-distributed rather than having multiple L3s.
SRAM array width seems a natural granule for optimization as
that is the physical read and write size (typically).
On the other hand, there seem to be other local cache conflict
factors (like bank conflicts) unrelated to "reservation
granule" that such is really probably not a consideration for
software optimization.
One might also have cases where an exported (word granule)
atomic operation is in conflict with a thread-performed (cache
block granule) atomic operation. If the cache block is owned
outside of the thread locally performing an atomic operation,
the exported atomic might be granted priority if the ownership
transfer was still queued, but there might be cache block
ping-pong with thread-performing atomics repeatedly taking
ownership. This implies that consistency with the type of
atomic operation could be important.
For addition, the operations could be coalesced as they pass
toward the owner in the network (and "split" as the result is
returned to individual agents/threads). Such might only make
sense for fire-and-forget operations (like software performance
counters or increasing reference counts) if the centralized
buffering is sufficient. Distributing the operations might
reduce network bandwidth by splitting a message into multiple
messages closer to their return destinations rather than having
multiple messages sent back from the central source, but having
multiple operations on the same data within the queuing delay
seems an uncommon case.
Decreasing reference counts are interesting in that the
interesting case is when the count becomes zero and a resource
freeing operation may be performed. Generally(?) this side
effect would not be synchronous with the thread changing the
reference count, so there could be some sense in allowing such
special cases to be caught outside the thread like a I/O event.
Of course, if resources were being released because they were
scarce and wanted by the releasing thread, then there might be
a latency penalty associated with sending the freeing work to
another hardware context.
Even with ordinary stores that might be read by other threads
(or later in a threads execution under out-of-order execution),
there seems to be some diversity of temporal distance in access
which might be exploited.
There might also be "just-a-peek" memory reads that are used for
general decision making (like reading the consumer pointer of a
buffer to measure roughly how full the buffer is and whether
work priorities should be changed). While it might not be
practical to allow these to have inconsistent ordering (in terms
of software and hardware complexity), there may be optimization
opportunities in exploiting such memory accesses. (Such might
also be useful for value prediction if coherence/ordering
correctness introduces excessive delay, getting a speculative
value earlier may sometimes be useful.)
I remember a paper suggesting assigning limited lifetimes to
cache block ownership to reduce coherence traffic; once the loan
period ended, the block would no longer be held. A "peek" might
be considered an infinitesimal length loan.
Actually, an arch that has both LL/SC and atomic RMW, well, that's
pretty nice. I think AMD has that? Give the programmer the ability to
choose.
Beyond that, the deeper answer is: don't hammer a single location at
all! damn it. A single lock-free stack under contention is a problem
regardless of whether you're using core-atomic or memory-atomic
instructions.
I would guess that there are software development tradeoffs in
how much effort is made to avoid conflicts (and promote
scalability). Sometimes having a global lock may make sense.
A hash table of lock/wait-free stacks/queues distributes that
pressure. There are single-producer/single consumer (SPSC), multi-
producer-single consumer (MPSC), all the way to (MPMC)... Each algo is
optimized for its usage pattern. Design for clever distribution first,
then pick your atomics. Actually if we map the per threads to the
hyper threads, cores, cpu's and numa nodes in a fractal way, then we
can scale fairly well... Gain the topology and localize it in a sense...
For something like a shared counter, I'd go thread-local first,
aggregated per-CPU, then per-NUMA node, etc. Keep sharing as local as
possible — affinity masks in user space to map the topology, and if
you must share outside per-thread variables, share local, as local as
possible...
Yet the "topology" changes if the atomic operations are
exported. If all shared counter operations are exported for a
particular address and the hardware uses aggressive
optimizations, then sharing a counter may be sufficiently
scalable with benefits in software simplicity (move complexity
to hardware☺) or other areas.
[snip remaining interesting content]
On 5/31/2026 5:57 PM, Paul Clayton wrote:----------------
On 5/28/26 3:37 PM, Chris M. Thomasson wrote:
On 5/27/2026 2:44 PM, MitchAlsup wrote:
[snip remaining interesting content]
The topology needs to be available to the programmer, full stop in a sense... I get the appeal of "let the hardware handle it" but that's not good enough? If I can't see the topology, I can't make "informed
decisions" about "data placement", affinity, and locality...
Hardware optimizations are a nice bonus on top of good software design. Shit. Howerver, they're not a substitute for it? And if I can't reason about
the shape of the system, I can't even make an informed decision about
when a global lock is acceptable versus when I need a distributed
structure.
That tradeoff requires knowledge, not abstraction.
Also, things do NOT need to be all seq_cst. Shit. My SPARC/RMO
experience is directly relevant here. All the atomic RMWs are relaxed,
and you use MEMBAR to express exactly the ordering you need and nothing more. Avoid #StoreLoad at all costs — only pay for it when the algorithm genuinely demands it. That discipline forces you to think carefully
about what guarantees you actually need between specific threads, which
is a feature not a bug.
And here's where the two points connect: if you understand the topology,
you know which threads share which data, which tells you what ordering guarantees you actually need between them.
Abstract away the topology
and you're pushed toward conservative, expensive, blanket seq_cst
everywhere because you can't reason about anything more precise. Expose
the topology and you can be surgical about it.
Relaxed where you can. Stronger ordering only where you must. And always knowing the shape of the system you're running on. ;^)
Also, for x86 yes we need to pad and align for the l2 cache line anyway.--- Synchronet 3.22a-Linux NewsLink 1.2
So shit happens?
"Chris M. Thomasson" <[email protected]> posted:[...]
Consider the galaxy scale system mentioned above. What kind of topology
will be adequate for that kind of scale ??
Also, for x86 yes we need to pad and align for the l2 cache line anyway.
So shit happens?
On 6/1/2026 6:18 PM, MitchAlsup wrote:
[...]
"Chris M. Thomasson" <[email protected]> posted:
Consider the galaxy scale system mentioned above. What kind of topology
will be adequate for that kind of scale ??
Sorry for the quick response, I am really busy right now. Will get back
to you with more detail sometime tonight.
Galaxy level? The super massive black hole. Its smaller black holes wandering the dust lanes? The stars, the solar systems, the planets in orbit. Down to the per-thread level...
Say, for starters:
per-thread
per-cpu
per-numa
finally go global only if we have to!
Also, try to avoid sharing when possible.
As for programmers that know this shit, well, they are few and far
between. But, that does not mean, to borrow a phrase from Joe Seigh...
When we get hardware we make software.
When we get lemons we make lemonade.
I love RMO! The Alpha is too strict, but shit happens. Alpha is too
strict in the sense that it does not honor a load dependency. RCU guys
know all about it... shit happens...
Also, for x86 yes we need to pad and align for the l2 cache line anyway. >>> So shit happens?
This is mainly for Chris M. Thomasson but everyone is free to pile on::
Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::
define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties
see that the unit of work was either never started or finished completely >without any intermediate state being observed by any party other than the >core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >fall under this definition.
define: memory-atomic as a unit of work identified in a core and packaged
up to be performed by the cache hierarchy or PCIe device control registers. >Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, >CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >of-work are performed in the cache having write-permission {EXCLUSIVE or >MODIFIED cache line states}. When unCacheable, these units of work are >performed directly on the addressed control register. In both cases, any >interested party sees only the previous state or the after state.
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
This is mainly for Chris M. Thomasson but everyone is free to pile on::
Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::
define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >see that the unit of work was either never started or finished completely >without any intermediate state being observed by any party other than the >core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >fall under this definition.
define: memory-atomic as a unit of work identified in a core and packaged >up to be performed by the cache hierarchy or PCIe device control registers. >Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP, >CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >of-work are performed in the cache having write-permission {EXCLUSIVE or >MODIFIED cache line states}. When unCacheable, these units of work are >performed directly on the addressed control register. In both cases, any >interested party sees only the previous state or the after state.
I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.
The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.
I feel the best rule for atomic design (for hardware and software) is to always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.
The problem with memory-atomic/Far-atomic is if the lock is on the same cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around.
To do the Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.
There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the platform-specific way to do an atomic transaction (and with all the restrictions), this is high overhead for software, and does NOT match how most software is written.
But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that cacheline Exclusive and do the atomic operation, a Tentative Transaction starts. You block responding to snoops until we decide we have to. We
can do misses and whatnot, but after say 500 clocks, we allow snoops
again (to avoid deadlocks). The transaction ends when the atomic location
is written again.
What this achieves is an end to thrashing. When it's working, the transaction-prediction hardware lets the thread get the lock, and do its work, and release the lock, in one go, not responding to outside snoops.
When it does respond to the snoops, the first snooper immediately wins the lock and gets its work done, and so on. For this to work easily, the software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.
Add a Transaction-prediction structure, to mark which atomics seem to be
the start of a transaction, and which ones are not, and roughly what that transaction looks like. And then you've solved the cache thrashing of locks--after seeing an atomic a few times, you learn what it's doing, and
can try to make it succeed each time it executes. And how long each transaction seems to be, the raise the limit for blocking snoops to
allow it to be more likely to succeed.
To handle the load peeking at a lock to see if its free requires
the Transaction-prediction hardware to detect this load and effectively
turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
and it becomes the start of the transaction.
This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing. If the software algorithm is poor,
where there's a big lock protecting a lot of work, then it exceeds the hardware's ability to do efficiently, and the hardware instead works like other platforms, and there can be thrashing.
Kent
In article <[email protected]>
On 6/7/2026 8:23 AM, Kent Dickey wrote:
In article <[email protected]>
Sorry for the big snip...
[...]
Fwiw, from my personal experience, two locks need to be padded and
aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
the reservation granule. Get the granule size wrong on LL/SC and your SC
can cause some interesting harm (a lot of sc failures, etc.) to another >core's reservation even when the locks have no logical relationship to
each other...
So if your algo is using LL/SC you need to worry about the reservation >granule.
If its using something akin to LOCK RMW, well, the l2 cache line is your >friend.
[email protected] (Kent Dickey) posted:[snip]
Thank you Kent, much valuable information below.
I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.
LL/SCs are designed to take no more than LD/ST pipeline time outside of >contention.
The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.
What do you do when you have more than 1 cache line--and all lines need
to be in an exclusive state prior to modifying concurrent data ??
I feel the best rule for atomic design (for hardware and software) is to
always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.
OK.
The problem with memory-atomic/Far-atomic is if the lock is on the same
cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around.
The whole reason for {memory,far}-atomic is to avoid moving data around
the interconnect inorder to perform an event/transaction.
To do the
Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The >> usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.
Fair point-but Chris M. Thomasson makes the point that the locks should not >be placed together--your position is that this happens "more or less" all
the time. Hmmmmm
There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the
platform-specific way to do an atomic transaction (and with all the
restrictions), this is high overhead for software, and does NOT match how
most software is written.
But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that
cacheline Exclusive and do the atomic operation, a Tentative Transaction
starts. You block responding to snoops until we decide we have to. We
In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention >because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become >Exclusive.
can do misses and whatnot, but after say 500 clocks, we allow snoops
again (to avoid deadlocks). The transaction ends when the atomic location >> is written again.
In a system with multiple-chips each containing dozens of cores, 500
cycles might not be enough to even send a CI around the interconnect.
What this achieves is an end to thrashing. When it's working, the
transaction-prediction hardware lets the thread get the lock, and do its
work, and release the lock, in one go, not responding to outside snoops.
My 66000 ships <core> priority around with memory accesses so that the
remote cache can determine if it fails or if requestor fails. Helping
higher priority atomic-events at the cost to lower priority events.
But also; if core reaches the beginning of the stores (to participating
cache lines) while all this data traffic is sorting itself out, then,
core gains the ability to NaK SNOOP requests (priority based) to >participating lines until the event finishes with the SC to the final >variable in memory.
My 66000 ISA contains the necessary instruction semantics to perform >core-atomic events over multiple participating cache lines (up to 8).
When it does respond to the snoops, the first snooper immediately wins the >> lock and gets its work done, and so on. For this to work easily, the
software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.
This seems to me to be wrongly optimistic!! Another core attempting a >transaction on a concurrent data structure, and the attempt fails
needs to understand that the CDS is no longer in the state this core
read out at the beginning, and instead of trying to complete, it should
take a fresh look at CDS. For example several cores try to deQueue
work from the front of a Queue. One core "wins" and deQueues the front
unit of work. The others need to know the Queue element they were going
after is not there are fail to a point where they take a fresh look at
the Queue.
Add a Transaction-prediction structure, to mark which atomics seem to be
the start of a transaction, and which ones are not, and roughly what that
transaction looks like. And then you've solved the cache thrashing of
locks--after seeing an atomic a few times, you learn what it's doing, and
can try to make it succeed each time it executes. And how long each
transaction seems to be, the raise the limit for blocking snoops to
allow it to be more likely to succeed.
How are you going to weave in that some cores have higher priority than >others? You don't want a low priority queue NaKing snoops which would
allow the higher priority core to make forward progress at the expense
of the lower priority cores. As an example, the low priority event is >registering that the left-rear tire is low on air pressure, while the
high priority event is registering panic-braking to avoid a collision.
To handle the load peeking at a lock to see if its free requires
the Transaction-prediction hardware to detect this load and effectively
turn it into a Read-and-get-Write-Ownership, which solves the thrashing,
and it becomes the start of the transaction.
My 66000 can respond to Read-with-Intent with here is data, but it is Shared >not Exclusive (based on priority at location of cache line).
This avoids the need for any software changes, while directly speeding up
atomics and what atomics are doing. If the software algorithm is poor,
where there's a big lock protecting a lot of work, then it exceeds the
hardware's ability to do efficiently, and the hardware instead works like
other platforms, and there can be thrashing.
I agree with most of the functionality of the transaction-predictor,
but it is already built into My 66000 ISA--and is not a predictor,
but mechanics.
Kent
THanks again.
In article <[email protected]>,------------------------
MitchAlsup <[email protected]d> wrote:
My 66000 ISA contains the necessary instruction semantics to perform >core-atomic events over multiple participating cache lines (up to 8).
I'm proposing an idea that can be implemented on any CPU, and requires no special support from the interconnect.
I think your NAK idea sounds risky. I've dealt a lot in the past with
buses using retries, and the short answer is it's VERY HARD to implement
a protocol with retries and not have degenerate starvation cases.
Everyone thinks they've got it solved, that it's easy, and it's just not easy. There's
a reason retries are removed from almost all modern protocols.
The problem is if you have 3 CPUs contending, and the way NAKs work one
CPU might be the unlucky victim forever, with 2 CPUs starving it forever.
In general, it's best to avoid retry-like behavior. For simple cases, you could build in some starvation detection, but in general that just pushes your starvation/livelock case to a more complex case, not fix it.
When it does respond to the snoops, the first snooper immediately wins the >> lock and gets its work done, and so on. For this to work easily, the
software should NOT peek at the lock location with a load, to "see" if
it's free. It should just grab it.
This seems to me to be wrongly optimistic!! Another core attempting a >transaction on a concurrent data structure, and the attempt fails
needs to understand that the CDS is no longer in the state this core
read out at the beginning, and instead of trying to complete, it should >take a fresh look at CDS. For example several cores try to deQueue
work from the front of a Queue. One core "wins" and deQueues the front
unit of work. The others need to know the Queue element they were going >after is not there are fail to a point where they take a fresh look at
the Queue.
I tried to expand my thinking with the log() function example so you can see that each CPU will see the updated state left by the previous one.
For your CPU, you propose a Transaction sequence with a start and end instruction, and I'm just pointing out you can get basically all of the benefit with some prediction hardware picking out mutexes and making them just as efficient.
But with NONE of the deadlock or forward progress--- Synchronet 3.22a-Linux NewsLink 1.2
issues, since if things go "wrong", you just default to running the code
as written.
[email protected] (Kent Dickey) posted:
Thank you Kent, much valuable information below.
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
This is mainly for Chris M. Thomasson but everyone is free to pile on::
Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::
define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >>> see that the unit of work was either never started or finished completely >>> without any intermediate state being observed by any party other than the >>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS} >>> fall under this definition.
define: memory-atomic as a unit of work identified in a core and packaged >>> up to be performed by the cache hierarchy or PCIe device control registers. >>> Instructions like the openGL subroutines {ADD, SUB, AND, OR, XOR, SWAP,
CMP-SWAP}-to-memory along with {XADD, ...}. When cacheable, these units- >>> of-work are performed in the cache having write-permission {EXCLUSIVE or >>> MODIFIED cache line states}. When unCacheable, these units of work are
performed directly on the addressed control register. In both cases, any >>> interested party sees only the previous state or the after state.
I understand why some folks like memory-atomic--it solves the
transaction storm issue when a location is heavily contended. The
tradeoff is making the atomic action very slow all the time even when
there's no contention. To simplify discussion, I'm just going to call
the contended atomic location a "lock", even if it isn't really one.
LL/SCs are designed to take no more than LD/ST pipeline time outside of contention.
The Arm term for your memory-atomic is Far-Atomic. If the CPU does an
atomic and it's not in the cache, then it sends the atomic operation
into the fabric to let some other central thing do the atomic operation.
And your core-atomic is just a way to say the atomic is done by a CPU
once it gets the cacheline Exclusive.
What do you do when you have more than 1 cache line--and all lines need
to be in an exclusive state prior to modifying concurrent data ??
I feel the best rule for atomic design (for hardware and software) is to
always optimize the non-contended case, and try to make the
non-contended case as likely as possible. The core-atomic is therefore
what should be planned for and optimized for.
OK.
The problem with memory-atomic/Far-atomic is if the lock is on the same
cacheline as the data its protecting (which is pretty common), then
there is a lot of extra memory traffic pushing this around.
The whole reason for {memory,far}-atomic is to avoid moving data around
the interconnect inorder to perform an event/transaction.
To do the
Far-atomic needs to snoop that data out, do the atomic, return the
atomic result, and then the initiating CPU still needs to fetch the
line. This is all wasted time, especially when there's no contention! The >> usual solution to contention is lots of locks, so the chance that any of
them will be in the cache is low.
Fair point-but Chris M. Thomasson makes the point that the locks should not be placed together--your position is that this happens "more or less" all
the time. Hmmmmm
There are ways to reduce the transaction storm around a contended lock,
but they require hardware to get more involved in the entire
"transaction". To take an example, assume a mutex protects some piece
of data which needs some software manipulation (think the head of a
linked list, and we're going to insert at the head). If written out as
an explicit transaction, where software has to interact with the
platform-specific way to do an atomic transaction (and with all the
restrictions), this is high overhead for software, and does NOT match how
most software is written.
But hardware could dynamically detect transactions when done as
core-atomic. Assume an atomic operation is the start of a
"transaction", for instance where we get a lock, then do something
while owning the lock, and then release the lock. Once we bring in that
cacheline Exclusive and do the atomic operation, a Tentative Transaction
starts. You block responding to snoops until we decide we have to. We
In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become
Exclusive.
This avoids the need for any software changes, while directly speeding up atomics and what atomics are doing.
Do you understand how hard it is to perform Double-Compare-Double-Swap without something special in the protocol under withering contention ??
On 6/7/2026 11:34 AM, MitchAlsup wrote:
[email protected] (Kent Dickey) posted:
Thank you Kent, much valuable information below.
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
This is mainly for Chris M. Thomasson but everyone is free to pile on:: >>>
Given a normal chip with multiple cores and a typical cache hierarchy
with one-or-more PCIe root complexes::
define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >>> see that the unit of work was either never started or finished completely >>> without any intermediate state being observed by any party other than the >>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS}
fall under this definition.
DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK CMPXCHG8B on a 32 bit system.
In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become Exclusive.
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded!
strong
does not allow that to occur. So, it can be used for interesting state machines and such.
[...]
The main point is that the software needs to pad and align accordingly.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.--- Synchronet 3.22a-Linux NewsLink 1.2
LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.
On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]
Do you understand how hard it is to perform Double-Compare-Double-Swap without something special in the protocol under withering contention ??
It sounds like you are trying to fulfill a lot of the academic papers?
However, DWCAS is a really important one. This is only a double width compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are non-contiguous, then things get into LL/SC like worlds...
[...]--- Synchronet 3.22a-Linux NewsLink 1.2
On 6/7/2026 11:34 AM, MitchAlsup wrote:-------------------------
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state machines and such.
[...]
The main point is that the software needs to pad and align accordingly. Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an atomic RMW using L2 cache lines.
"Chris M. Thomasson" <[email protected]> posted:
On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]
Do you understand how hard it is to perform Double-Compare-Double-Swap
without something special in the protocol under withering contention ??
It sounds like you are trying to fulfill a lot of the academic papers?
However, DWCAS is a really important one. This is only a double width
compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are
non-contiguous, then things get into LL/SC like worlds...
DDCAS works better as both comparands are 64-bit and back-to-back.
Don't see how you can much to work with two 32-bit comparands.
[...]
"Chris M. Thomasson" <[email protected]> posted:
On 6/7/2026 11:34 AM, MitchAlsup wrote:
[email protected] (Kent Dickey) posted:
Thank you Kent, much valuable information below.
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
This is mainly for Chris M. Thomasson but everyone is free to pile on:: >>>>>
Given a normal chip with multiple cores and a typical cache hierarchy >>>>> with one-or-more PCIe root complexes::
define: core-atomic as a sequence of instructions (possible 1) that
perform a single unit of work atomically--that is all interested parties >>>>> see that the unit of work was either never started or finished completely >>>>> without any intermediate state being observed by any party other than the >>>>> core performing the event. Instructions like {T&S, LL/SC, CAS, DCAS, DCADS}
fall under this definition.
DWCAS, CAS on two adjacent words in the same L2 cache line. Akin to LOCK
CMPXCHG8B on a 32 bit system.
I wrote DCADS to represent the two addresses are completely unrelated
{like *p and *q are unrelated unless you know something special.
You can update 2 pointers in a linked list.
------------------------
In CAS atomics, the LDs which start the lock acquire are not special
in any way; making it hard for core to detect. LL, on the other hand,
tells core that something special is beginning. This reduces contention
because the LD (of CAS) is performed without "for modification" and
thus the lines can be shared. But at CAS, the lines have to become
Exclusive.
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded!
This is what allows the ABA problem to raise its ugly head !! There
could be a week (7 days) between the LD and the CAS--there is no
expectation that the data (modified hundreds of times per second
over that week returns to the exact state at when the LD was performed.)
strong
does not allow that to occur. So, it can be used for interesting state
machines and such.
[...]
The main point is that the software needs to pad and align accordingly.
I have no argument with that. My argument is that one needs FEWER ATOMIC events when more memory locations can be manipulated in a single event.
Fewer Events lower the exponent of interference and improves performance.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an
atomic RMW using L2 cache lines.
"Chris M. Thomasson" <[email protected]> posted:
On 6/7/2026 11:34 AM, MitchAlsup wrote:-------------------------
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state
machines and such.
[...]
The main point is that the software needs to pad and align accordingly.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an
atomic RMW using L2 cache lines.
If you want the strong CAS it is written::
LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
If you want weak CAS it is written:
LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
It just changes the length of time interference is being observed.
In article <1104fui$2ns6j$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On 6/7/2026 8:23 AM, Kent Dickey wrote:
In article <[email protected]>
Sorry for the big snip...
[...]
Fwiw, from my personal experience, two locks need to be padded and
aligned on a LOCK RMW arch, to the L2 cache line; on a LL/SC arch, to
the reservation granule. Get the granule size wrong on LL/SC and your SC
can cause some interesting harm (a lot of sc failures, etc.) to another
core's reservation even when the locks have no logical relationship to
each other...
So if your algo is using LL/SC you need to worry about the reservation
granule.
If its using something akin to LOCK RMW, well, the l2 cache line is your
friend.
Yes, different mutexes/locks/etc. need to be spread apart from each other,
to avoid accidental contention. However, if you have a huge number of mutexes/locks/etc. then it's less important, since by design you're
spreading accesses out over "enough", so you're less likely to get any contention. If you have 300 CPUs and 10,000 database rows with a mutex
per row, then contention is not likely a big deal.
My quick look at C++:
#include <mutex>
std::mutex sample_mutex;
show sample_mutex is 48 bytes. This includes OS/thread overhead to put the thread to sleep if the fast-grab algorithm doesn't work after a few tries, along with some internal alignment padding.
On 6/9/2026 6:28 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/8/2026 5:45 PM, MitchAlsup wrote:
[...]
Do you understand how hard it is to perform Double-Compare-Double-Swap >>>> without something special in the protocol under withering contention ?? >>>It sounds like you are trying to fulfill a lot of the academic papers?
However, DWCAS is a really important one. This is only a double width
compare and swap that operates on two contiguous words. Works great for
a great deal of real lock/wait-free algos. Now if the words are
non-contiguous, then things get into LL/SC like worlds...
DDCAS works better as both comparands are 64-bit and back-to-back.
Don't see how you can much to work with two 32-bit comparands.
[...]
Think of DWCAS on a 32-bit system. LOCK CMPXCHG8B on a 32 bit system. We
can do many interesting things with it. LOCK CMPXCHG16B on a 64-bit
system, ditto.
DWCAS only operates on two contiguous words. Well, that alone allows us
to create a lot of lock/wait free algos. The academic papers were
interested in DCAS, KCSS, ect... They never had to make the cpu that can
do that efficiently.
FWIW, DWCAS can do other things besides ABA. Think of using two 32 bit regions of memory and a DWCAS can store "offsets" into both of them.
On 6/7/2026 8:23 AM, Kent Dickey wrote:
[...]
This avoids the need for any software changes, while directly speeding up
atomics and what atomics are doing.
What about weak vs strong CAS? Also how does emulating a LOCK XADD speed >things up? Now SC can fail, and we have to loop. How many times do we
have to loop to know what that emulated LOCK XADD finishes? Its akin to >emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
former is quite grand.
Breaking things apart works just as well for LL/SC as atomic RMW.
How does the LL/SC work wrt the arch? How fine grain is it? Is its >reservation granule the same size as say a L2 cache line for atomic RMW?
On systems with both LL/SC and atomic RMW, is it one size fits all?
LL/SC can suffer from live lock and false sharing.
LOCK XADD can suffer from false sharing, its not going to live lock on ya.
[...]
In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On 6/7/2026 8:23 AM, Kent Dickey wrote:
[...]
This avoids the need for any software changes, while directly speeding up >>> atomics and what atomics are doing.
What about weak vs strong CAS? Also how does emulating a LOCK XADD speed
things up? Now SC can fail, and we have to loop. How many times do we
have to loop to know what that emulated LOCK XADD finishes? Its akin to
emulating a LOCK XADD with LOCK CMPXCHG. The latter is a mess, the
former is quite grand.
Breaking things apart works just as well for LL/SC as atomic RMW.
How does the LL/SC work wrt the arch? How fine grain is it? Is its
reservation granule the same size as say a L2 cache line for atomic RMW?
On systems with both LL/SC and atomic RMW, is it one size fits all?
LL/SC can suffer from live lock and false sharing.
LOCK XADD can suffer from false sharing, its not going to live lock on ya. >>
[...]
You trimmed everything, so there's really nothing I can respond to. You
have your view as to how atomics should work, and it's an interesting perspective, but not the only one.
If you had access to a system where std::mutex was faster and handled contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
than the best implementation on any system), would you use mutexes rather than lock-free algorithms?
On 6/10/2026 7:58 PM, Kent Dickey wrote:[snip]
In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On systems with both LL/SC and atomic RMW, is it one size fits all?
LL/SC can suffer from live lock and false sharing.
LOCK XADD can suffer from false sharing, its not going to live lock on ya. >>>
[...]
You trimmed everything, so there's really nothing I can respond to. You
have your view as to how atomics should work, and it's an interesting
perspective, but not the only one.
If you had access to a system where std::mutex was faster and handled
contention better than DCAS, CAS, and LOCK XADD in all cases (and faster
than the best implementation on any system), would you use mutexes rather
than lock-free algorithms?
Hummm...
std::mutex is not lock-free by default.
And more importantly, a LOCK XADD is fundamentally different from using
a std::mutex to emulate one. They are not two implementations of the
same thing.
LOCK XADD is a single atomic instruction, one hardware transaction, no >kernel involvement, no thread blocking, no scheduler interaction. Done.
A std::mutex wrapping an increment is a completely different beast. Even
in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse — >threads block, context switches happen, the OS scheduler gets involved.
So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation — it's what guarantees and costs come with the primitive. LOCK >XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties >regardless of how fast it is.
There are also contexts where you literally cannot use a mutex — signal >handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice.
ERROR "unexpected byte sequence starting at index 2542: '\xE2'" while decoding:
In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On 6/10/2026 7:58 PM, Kent Dickey wrote:[snip]
In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On systems with both LL/SC and atomic RMW, is it one size fits all?
LL/SC can suffer from live lock and false sharing.
LOCK XADD can suffer from false sharing, its not going to live lock on ya.
[...]
You trimmed everything, so there's really nothing I can respond to. You >> have your view as to how atomics should work, and it's an interesting
perspective, but not the only one.
If you had access to a system where std::mutex was faster and handled
contention better than DCAS, CAS, and LOCK XADD in all cases (and faster >> than the best implementation on any system), would you use mutexes rather >> than lock-free algorithms?
Hummm...
std::mutex is not lock-free by default.
And more importantly, a LOCK XADD is fundamentally different from using
a std::mutex to emulate one. They are not two implementations of the
same thing.
Lock-free algorithms solve two problems of mutexes: they are immune to a thread being reschedules while holding a lock; and they are a response
to poor performance of locks on many systems. Remote-memory atomics
(doing the atomic in the fabric or at the memory controller) are a
response to the poor contention performance of locks or atomics (on lock-free or locks) being done by the CPU on many platforms.
But if we fix those two problems, and make locks fast, then there's much
less need for lock-free algorithms.
Lock-free also solves the thread preemption issue--a thread that wins a mutex and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.
LOCK XADD is a single atomic instruction, one hardware transaction, no >kernel involvement, no thread blocking, no scheduler interaction. Done.
A std::mutex wrapping an increment is a completely different beast. Even >in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before >you've even touched the data. Under contention it gets much worse â >threads block, context switches happen, the OS scheduler gets involved.
So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation â it's what guarantees and costs come with the primitive. LOCK
XADD gives you atomicity with no blocking, no kernel, no priority >inversion, no deadlock risk. A mutex gives you none of those properties >regardless of how fast it is.
You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences.
I didn't describe this in that post, but if you add in a CPU which supports sequential ordering with the mutex Transaction Prediction logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.
There are also contexts where you literally cannot use a mutex â signal
handlers, certain real-time contexts, interrupt service routines. >Lock-free is not just a performance choice, it's sometimes the only choice.
If you have non-interruptable user mutexes, then it can work in these
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions.
A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW implementation.
Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.
What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.
Lock-free algorithms make it much harder for hardware to see what is the
real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new item's next-> to point to that, then doing Compare-and-Swap on the head pointer with our item pointer.
It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the CAS to the head pointer, and can try to make that whole sequence atomic,
but this is just a trivial lock-free case).
Kent--- Synchronet 3.22a-Linux NewsLink 1.2
On 6/9/2026 6:33 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/7/2026 11:34 AM, MitchAlsup wrote:-------------------------
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state
machines and such.
[...]
The main point is that the software needs to pad and align accordingly.
Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS.
LL/SC is working on a reservation granule which can be different than an >> atomic RMW using L2 cache lines.
If you want the strong CAS it is written::
LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
If you want weak CAS it is written:
LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
It just changes the length of time interference is being observed.
Pretty interesting! So, we can have spurious failure on the weak CAS and none on the strong CAS?
strong CAS failure can be used in a state machine, weak CAS failure cannot...--- Synchronet 3.22a-Linux NewsLink 1.2
What's nice about a mutex, which is basically a user-land spinlock
[email protected] (Kent Dickey) posted:[snip]
In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On 6/10/2026 7:58 PM, Kent Dickey wrote:
In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
Lock-free also solves the thread preemption issue--a thread that wins a mutex
and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave >> some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.
The LL/SC sequence on My 66000 implementations is guaranteed to fail
when an interrupt (or exception) causes a transfer of control out of
the thread, not while running the critical section, but while acquiring
or releasing the mutex.
primitive. LOCK
LOCK XADD is a single atomic instruction, one hardware transaction, no
kernel involvement, no thread blocking, no scheduler interaction. Done.
A std::mutex wrapping an increment is a completely different beast. Even >> >in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse â >> >threads block, context switches happen, the OS scheduler gets involved.
So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation it's what guarantees and costs come with the
XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties
regardless of how fast it is.
You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences.
Do you agree that an ISA cannot make an atomic-acquire or atomic-release
any faster than a core can perform the same instruction sequence without
any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).
I didn't describe this in that post, but if you add in a CPU
which supports sequential ordering with the mutex Transaction Prediction
logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.
The LL/SC semantics on My 66000 directly supports using as many as
8 cache lines in a single ATOMIC event with the property that all
3rd party observers only see the before and after states of all
cache lines. From the beginning of the first LL to the end with an
SC, all such participating memory modifications appear atomic.
Thus, there was no need for the concept of nesting nor serialization.
One does not have to get 1 mutex and then get another--you can get both
at the same time (should you need that).
This brings up the notion of the difference between an ATOMIC-event
and a critical section. An ATOMIC-event is a unit of work that all
interested 3rd parties can all agree on "order" {Joe got it Sally
did not}. All of this can be handled by addresses, without consideration
of any data value (address values yes, data values, no). Acquire-mutex
and release-mutex are ATOMIC based on address. While the rest of the
critical section is primarily based on data (and pointers used as data).
There are also contexts where you literally cannot use a mutex â signalIf you have non-interruptable user mutexes, then it can work in these
handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice. >>
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions.
It also requires a control transfer point so that if an interrupt is
not able to be deferred, that the mutex guarded critical section has
a way to KNOW that the un-interruptible-event was, indeed, interrupted.
My 66000 has this built into its pipelined version of LL/SC.
A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW
implementation.
I like to say, My 66000 ATOMICs are not a form of TM but can be used to
make a SW version of TM have the appropriate semantics guarantees.
Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free
algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.
What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.
From the HW point of view (and the programming model of My 66000),
There are 3 noted events:
a) acquisition of the mutex(s)
b) transit of the critical section
c) release of the mutex(s).
The problem with old style ATOMICs is that while (a) is easily visible
as ATOMIC, (c) is not (although a predictor would likely work quite
well.
The problem with CAS style ATOMICs is that the initial LD(s) are not
seen as ATOMIC until the CAS is performed, opening a window for ABA.
The problem with LL/SC style ATOMICs is that while the whole sequence
is known to be a critical section, spurious failures are allowed in
most implementations.
It is possible with My 66000 ATOMICs to perform the whole critical section >WITHIN a single ATOMIC-event.
Lock-free algorithms make it much harder for hardware to see what is the
real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new
item's next-> to point to that, then doing Compare-and-Swap on the head
pointer with our item pointer.
Try that on a doubly linked list. That is where DCADS comes in.
It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the >> CAS to the head pointer, and can try to make that whole sequence atomic,
but this is just a trivial lock-free case).
Why thinking about this this afternoon, I am left wondering if the page >tables could be augmented with a bit indicating that ATOMIC stuff is going >on on this page ??
ERROR "unexpected byte sequence starting at index 1569: '\xC3'" while decoding:
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
[email protected] (Kent Dickey) posted:[snip]
In article <110f075$1lt51$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
On 6/10/2026 7:58 PM, Kent Dickey wrote:
In article <110a0ml$ao8m$[email protected]>,
Chris M. Thomasson <[email protected]> wrote:
Lock-free also solves the thread preemption issue--a thread that wins a mutex
and then gets interrupted blocks the entire application that relies on
that mutex since that thread is no longer running, so if the platforms gave
some guarantees about that not occurring, then any lock-free algorithm
can pretty easily be changed to use a non-preemptible mutex instead.
The LL/SC sequence on My 66000 implementations is guaranteed to fail
when an interrupt (or exception) causes a transfer of control out of
the thread, not while running the critical section, but while acquiring
or releasing the mutex.
primitive. LOCK
LOCK XADD is a single atomic instruction, one hardware transaction, no >> >kernel involvement, no thread blocking, no scheduler interaction. Done. >> >
A std::mutex wrapping an increment is a completely different beast. Even >> >in the uncontended fast path you're doing an atomic RMW on the lock
word; a memory barrier; _critical section_; then another membar and
atomic RMW on unlock. That's already more work than LOCK XADD before
you've even touched the data. Under contention it gets much worse âÂÂ
threads block, context switches happen, the OS scheduler gets involved. >> >
So the hypothetical of 'what if std::mutex was faster' is interesting
but it's asking the wrong question. The question isn't speed in
isolation it's what guarantees and costs come with the
XADD gives you atomicity with no blocking, no kernel, no priority
inversion, no deadlock risk. A mutex gives you none of those properties >> >regardless of how fast it is.
You may want to re-read my earlier post. I just proposed a way to make
mutex faster than normal processors do even the most basic atomic
sequences.
Do you agree that an ISA cannot make an atomic-acquire or atomic-release >any faster than a core can perform the same instruction sequence without >any of the atomic-stuff applied ??? (LL/SC) becoming (LD/ST).
The obvious answer is no, and that is true for an uncontended atomic instruction. However, once contention comes in to play, anything that reduces the contention storm can be a big win.
It's not clear how to
reduce that storm when it's caused by ordinary LD/ST, but I see lots of
ways to reduce it much of the time if HW can detect the operation's boundaries.
I didn't describe this in that post, but if you add in a CPU >> which supports sequential ordering with the mutex Transaction Prediction >> logic, then that CPU can speculate and execute multiple locks at once
easily (nested or serial), which is something weakly-ordered CPUs just
cannot do since the barriers prevent starting new requests until
ordering is guaranteed.
The LL/SC semantics on My 66000 directly supports using as many as
8 cache lines in a single ATOMIC event with the property that all
3rd party observers only see the before and after states of all
cache lines. From the beginning of the first LL to the end with an
SC, all such participating memory modifications appear atomic.
Thus, there was no need for the concept of nesting nor serialization.
One does not have to get 1 mutex and then get another--you can get both
at the same time (should you need that).
This brings up the notion of the difference between an ATOMIC-event
and a critical section. An ATOMIC-event is a unit of work that all >interested 3rd parties can all agree on "order" {Joe got it Sally
did not}. All of this can be handled by addresses, without consideration
of any data value (address values yes, data values, no). Acquire-mutex
and release-mutex are ATOMIC based on address. While the rest of the >critical section is primarily based on data (and pointers used as data).
Transactional Memory from 15 years ago was going to fundamentally change
the way multi-threaded apps worked. And then it fizzled out. Because
the goal was laudable: remove locks and atomic altogether, and make
adding a record to a database be just:
START_TXN();
add_record_to_database();
END_TXN();
and just do add_record_to_database() effectively atomically. The problem
is that edge cases mean sometimes add_record_to_database() can be 5-10 instructions, and sometimes it's thousands. And Transactional Memory
is very complex for hardware to get right, and it's hard for software to
get right too (since software cannot let there be thousands of instructions inside the TXN, so it needs to be rewritten).
Basically, there are too many unsolved complex problems. Some of which
apply to your ESM--needing software changes is a big problem. And so far
no one has shown how to exhaustively validate this kind of mechanism.
Because there's no real fallback: it has to work, it's not a feature that
can just be "turned off".
There are also contexts where you literally cannot use a mutex â signal
handlers, certain real-time contexts, interrupt service routines.
Lock-free is not just a performance choice, it's sometimes the only choice.
If you have non-interruptable user mutexes, then it can work in these
places. I did not explain how to do this, but is a pretty obvious
add-on, but likely requires a new hint bit on the atomic instructions.
It also requires a control transfer point so that if an interrupt is
not able to be deferred, that the mutex guarded critical section has
a way to KNOW that the un-interruptible-event was, indeed, interrupted.
My 66000 has this built into its pipelined version of LL/SC.
A non-interruptable mutex ends up looking a bit like an explicit
Transaction, but it has different properties and a very different HW
implementation.
I like to say, My 66000 ATOMICs are not a form of TM but can be used to >make a SW version of TM have the appropriate semantics guarantees.
I've had more time to think about my vague description on how to make an non-interruptible mutex, and now I think it's not a good idea. I think trying to hold off actual interrupts is still a good idea, but not to have
HW try to back out of the mutex operation itself.
It's just too much like Transactional Memory--rather than being an optional acceleration which can
be turned off a problem is found, non-interruptible mutex must always be correct and must provide guarantees, and so it's wandered into the minefield of Transactional Memory, and it's best to stay out of there.
Basically, the atomic operation needs a hint that this
is a non-interruptable mutex, so the hardware knows how to release the
lock if an exception occurs (such as a TLB miss). A compare-and-swap
getting the mutex is different than a compare-and-swap doing a lock free >> algorithm, so we need a direct indication from the code that it's a
mutex to get this right. Note that not getting the hint can still
let HW make the mutex fast, it just isn't non-interruptable.
What's nice about a mutex, which is basically a user-land spinlock, is
it's about easiest for software (it's almost always the first described
to new coders since it's easy to understand), AND the easiest for
hardware to accelerate. The critical section is all in one place.
From the HW point of view (and the programming model of My 66000),
There are 3 noted events:
a) acquisition of the mutex(s)
b) transit of the critical section
c) release of the mutex(s).
The problem with old style ATOMICs is that while (a) is easily visible
as ATOMIC, (c) is not (although a predictor would likely work quite
well.
Yes, I agree. But note the release of the mutex is to the same address
as the atomic instruction, so it can be detected after it's been seen.
The problem with CAS style ATOMICs is that the initial LD(s) are not
seen as ATOMIC until the CAS is performed, opening a window for ABA.
Agreed, and it's hard to pick out the preparation work being done for
these lock-free algorithms.
The problem with LL/SC style ATOMICs is that while the whole sequence
is known to be a critical section, spurious failures are allowed in
most implementations.
I am not a fan of LL/SC. It's possible to get it right in HW, but most designers underestimate the invasiveness of LL/SC--the entire load/store
unit needs little modifications everywhere to make LL/SC work efficiently
and reliably. You want to implement transient memory or writes which
bypass the cache, or you speculatively execute instructions past the
end of the LL/SC loop--make sure you solve how they interact with LL/SC.
I've yet to meet any design which got it "right" the first try (but once
a design team learns, then they get it right, until another new system feature pops up and breaks it).
LL/SC is the academic solution to building "let's allow anything" atomics.
"We can let the user build the atomic operations are their dreams, and
we don't have to solidify it in HW, we just let them write code."
It's like a preliminary Transactional Memory--we can let the users do a
lot with us just needing to do a little bit in hardware.
And then when LL/SC gets implemented, you basically end up with Fetch-and-Add, Swap, and Compare-and-Swap (and other trivial ops, like Fetch-and-AND, etc.) only. The RISC-V restrictions are pretty intense:
You can do LR; you can then do a handful of integer instructions (no FP,
no vector) not including any ld/st or system instructions, nor any
backward branches, no function calls; and then SC, then you get the same restrictions but now you get a backwards branch.
What exactly are you
able to implement with these restrictions other than the usual atomic operations that now most everyone puts in HW?
Note on RISC-V, if an implementation has LR/SC, then it must ALSO have the atomic instructions Fetch-and-Add, Swap, Compare-and-Swap, and the other simple ones. So why bother with LR/SC?
I understand where RISC-V's LL/SC restrictions are coming from. I get it. Which is why I don't like LL/SC.
It is possible with My 66000 ATOMICs to perform the whole critical section >WITHIN a single ATOMIC-event.
Lock-free algorithms make it much harder for hardware to see what is the >> real action that needs to occur. Inserting a new head pointer to a
linked list requires looking at the current head pointer, setting my new >> item's next-> to point to that, then doing Compare-and-Swap on the head
pointer with our item pointer.
Try that on a doubly linked list. That is where DCADS comes in.
When I say Compare-and-Swap, I implicitly include DCAS. DCADS is more exotic, so if I mean that, I'd explicitly list it.
It's hard to have hardware detect the
prep work (but not impossible--in this particular case, we can see a
load to the head pointer, then the write to the next-> field, and then the >> CAS to the head pointer, and can try to make that whole sequence atomic, >> but this is just a trivial lock-free case).
Why thinking about this this afternoon, I am left wondering if the page >tables could be augmented with a bit indicating that ATOMIC stuff is going >on on this page ??
I'm just not sure that provides much help.
To me, the key insight which I wanted to get across was that with a mutex (and spinlock), it's a huge advantage to NOT let other CPUs ever see the mutex lock location in the locked state. It solves the contention storm,
and yet it's "optional" meaning if hardware has to "give up", it's still correct (just lower performance).
And that lock-free algorithms would
get the same benefit IF IT CAN BE DETECTED, but that it's much harder
for HW to detect (since the lock-free set of operations is potentially
much larger and more complex).
And that with HW support to solve the contention storm, a mutex (and
a spinlock, which is inside a mutex), which is easy for software to use, could become faster than lock-free algorithms (when spinlocks or mutexes
make sense, since I'm abandoning the idea of making them work in interrupt handlers, etc.).
For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,
so I suggest making masking interrupts an extremely lightweight operation. When you see an
instruction which seems like it would mask interrupts in Fetch/Decode, immediately set a bit to block interrupts. Once the instruction
executes (or is discarded), clear that bit (maybe make it a 4-bit
counter, so you can speculate further). Don't serialize the instruction
that masks off interrupts.
Kent--- Synchronet 3.22a-Linux NewsLink 1.2
"Chris M. Thomasson" <[email protected]> posted:
On 6/9/2026 6:33 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/7/2026 11:34 AM, MitchAlsup wrote:-------------------------
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong
does not allow that to occur. So, it can be used for interesting state >>>> machines and such.
[...]
The main point is that the software needs to pad and align accordingly. >>>> Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS. >>>> LL/SC is working on a reservation granule which can be different than an >>>> atomic RMW using L2 cache lines.
If you want the strong CAS it is written::
LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
If you want weak CAS it is written:
LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
It just changes the length of time interference is being observed.
Pretty interesting! So, we can have spurious failure on the weak CAS and
none on the strong CAS?
It is not zero, but it is exponentially fewer; and no architecture is
going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).
strong CAS failure can be used in a state machine, weak CAS failure
cannot...
[email protected] (Kent Dickey) posted:[snip]
For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,
Or transition to a priority level that no interrupt can raise.
so I suggest making masking
interrupts an extremely lightweight operation. When you see an
instruction which seems like it would mask interrupts in Fetch/Decode,
immediately set a bit to block interrupts. Once the instruction
executes (or is discarded), clear that bit (maybe make it a 4-bit
counter, so you can speculate further). Don't serialize the instruction
that masks off interrupts.
Most of what you say above is unnecessary in My 66000 ISRs.
a) the priority of the core is automagically updated with the priority
of the interrupt.
b) the thread.state of the interrupted thread is returned to memory
c) the thread.state of the interrupting thread is fetched from memory
d) the fetch of thread.state can start before the store of thread.state
e) MSI-X interrupt message is already in R0.
So, by the time control arrives, ISR has a complete thread.state, including >the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and >near-40-bits from the interrupt so it can figure out what to do based on >control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single >instruction). When ISR performs SVR, SVR check pending threads in interrupt >table, and transfers control to any pending thread higher than interrupted >thread (SW does not have to check).
Interrupts do not need to be masked, as all ISRs of lesser priority have >already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled
with no compiler-knowledge of it being an ISR (and no ASM instructions).
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
[snip]
[email protected] (Kent Dickey) posted:
I'm just going to follow up on Linux kernel spinlocks.
For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,
Or transition to a priority level that no interrupt can raise.
The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled. And the kernel has spinlocks it needs to >win. But if the kernel thread gets interrupted with a higher priority task >once it gets the spinlock, but before it finishes with it, this is bad since >some other thread now has to wait a long time for the spinlock to free up.
It also might cause deadlocks.
So the Linux macro is spinlock_irqsave(), which first masks interrupts,
then wins the spinlock (spinning until it wins). Then do the operation >needed, then do spinunlock_irqrestore() to release the spinlock and
allow interrupts again.
The basic idea is this should be fast. We DO NOT want a higher priority >interrupt to occur while we own the spinlock. So we just need to
temporarily mask interrupts. And I'm suggesting this masking and unmasking >of interrupts should be very lightweight and fast.
I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe >they've moved all the important stuff to not use spinlock_irqsave() anymore, >I don't know. It still shows up a lot in the Linux source though in various >drivers at least.
So, by the time control arrives, ISR has a complete thread.state, including >>the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and >>near-40-bits from the interrupt so it can figure out what to do based on >>control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single >>instruction). When ISR performs SVR, SVR check pending threads in interrupt >>table, and transfers control to any pending thread higher than interrupted >>thread (SW does not have to check).
Interrupts do not need to be masked, as all ISRs of lesser priority have >>already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled >>with no compiler-knowledge of it being an ISR (and no ASM instructions).
No, during the kernel spinlock_irqsave(), you have to mask out all >interrupts, including higher-priority ones.
The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled.
On 6/11/2026 7:20 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/9/2026 6:33 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/7/2026 11:34 AM, MitchAlsup wrote:-------------------------
Notice the weak vs strong CAS in C++. Weak can fail even though the
location is equal to the comparand. It should have succeeded! strong >>>> does not allow that to occur. So, it can be used for interesting state >>>> machines and such.
[...]
The main point is that the software needs to pad and align accordingly. >>>> Using weak CAS in C++ is akin to use LL/SC emulation of a part of CAS. >>>> LL/SC is working on a reservation granule which can be different than an >>>> atomic RMW using L2 cache lines.
If you want the strong CAS it is written::
LDD Rd,[address]
...
...
LL R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
If you want weak CAS it is written:
LL Rd,[address]
...
...
LD R~d[address]
CMP Rt,Rd,R~d
PNE Rt,T
SC Rs,[address]
It just changes the length of time interference is being observed.
Pretty interesting! So, we can have spurious failure on the weak CAS and >> none on the strong CAS?
It is not zero, but it is exponentially fewer; and no architecture is
going to complain of holding back a snoop for 3 instructions (CMP-PNE-SC).
Just to clarify, strong CAS should never spuriously fail. Right?
strong CAS failure can be used in a state machine, weak CAS failure
cannot...
[email protected] (Kent Dickey) writes:
In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:
[snip]
[email protected] (Kent Dickey) posted:
I'm just going to follow up on Linux kernel spinlocks.
For Linux spinlocks, if my memory of this is still right, the Linux
kernel code often needs to mask interrupts,
Or transition to a priority level that no interrupt can raise.
The issue with spinlocks in the Linux kernel is that the kernel a lot of
the time has interrupts enabled. And the kernel has spinlocks it needs to >> win. But if the kernel thread gets interrupted with a higher priority task >> once it gets the spinlock, but before it finishes with it, this is bad since >> some other thread now has to wait a long time for the spinlock to free up. >> It also might cause deadlocks.
So the Linux macro is spinlock_irqsave(), which first masks interrupts,
then wins the spinlock (spinning until it wins). Then do the operation
needed, then do spinunlock_irqrestore() to release the spinlock and
allow interrupts again.
The basic idea is this should be fast. We DO NOT want a higher priority
interrupt to occur while we own the spinlock. So we just need to
temporarily mask interrupts. And I'm suggesting this masking and unmasking >> of interrupts should be very lightweight and fast.
I don't follow Linux closely, so this was a bigger deal 20 years ago. Maybe >> they've moved all the important stuff to not use spinlock_irqsave() anymore, >> I don't know. It still shows up a lot in the Linux source though in various >> drivers at least.
So, by the time control arrives, ISR has a complete thread.state, including >>> the MMU tables, ASID, SP, FP (when needed) from the dormant thread, and
near-40-bits from the interrupt so it can figure out what to do based on >>> control arriving here. ISR can use any instruction including ESM. ISR
can do DPC/softIRQ by ST [MMI/O] targeting an interrupt table (single
instruction). When ISR performs SVR, SVR check pending threads in interrupt >>> table, and transfers control to any pending thread higher than interrupted >>> thread (SW does not have to check).
Interrupts do not need to be masked, as all ISRs of lesser priority have >>> already been blocked, and the thread is already in a re-entrant state--
so, no quirky instruction sequences are necessary. ISRs can be compiled
with no compiler-knowledge of it being an ISR (and no ASM instructions).
No, during the kernel spinlock_irqsave(), you have to mask out all
interrupts, including higher-priority ones.
While true, spinlock_irqsave() will mask interrupts on the current core/hardware thread; Other cores/threads will continue running and
taking interrupts that are either 1-of-N interrupts or specifically
assigned to a core.
In any operating system where nested interrupts on a core are supported, the interrupt handling code must be very carefully coded to ensure that
the handler of a lower priority interrupt does not obtain resources
(e.g. locks) that the higher priority interrupt will require to complete processing the interrupt (classic deadlock). There are other issues
related to kernel thread stack sizes that also pop up with nested
interrupts.
Linux doesn't support nested interrupts on a core, rather it
defers the interrupt hanndling work to a kernel thread and returns immediately from the interrupt, unless handling the interrupt is trivial.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,123 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 32:25:32 |
| Calls: | 14,371 |
| Files: | 186,380 |
| D/L today: |
570 files (158M bytes) |
| Messages: | 2,540,570 |