Does not matter what memory model you are using RMO, PSO, TSO... I was always taught to program for it as if its always in RMO mode. So, if you want a atomic RMW with acquire semantics, we had to:
atomic RMW
MEMBAR #LoadStore | #LoadLoad
For release:
MEMBAR #LoadStore | #StoreStore
atomic RMW
Try to avoid #StoreLoad at all costs, unless absolutely necessary.--- Synchronet 3.22a-Linux NewsLink 1.2
"Chris M. Thomasson"<[email protected]> posted:
Does not matter what memory model you are using RMO, PSO, TSO... I wasMake sure all out-of-order LDs have been made visible system-wide
always taught to program for it as if its always in RMO mode. So, if you
want a atomic RMW with acquire semantics, we had to:
atomic RMW
MEMBAR #LoadStore | #LoadLoad
after the atomic so that Lamport's criterion has been met.
Same as above but s/LDs/STs/
For release:
MEMBAR #LoadStore | #StoreStore
atomic RMW
Note: My 66000 performs these on behalf of the programmer without a
MEMBAR instruction.
Try to avoid #StoreLoad at all costs, unless absolutely necessary.
On 6/5/2026 6:47 PM, MitchAlsup wrote:
"Chris M. Thomasson"<[email protected]> posted:
Does not matter what memory model you are using RMO, PSO, TSO... I wasMake sure all out-of-order LDs have been made visible system-wide
always taught to program for it as if its always in RMO mode. So, if you >> want a atomic RMW with acquire semantics, we had to:
atomic RMW
MEMBAR #LoadStore | #LoadLoad
after the atomic so that Lamport's criterion has been met.
Same as above but s/LDs/STs/
For release:
MEMBAR #LoadStore | #StoreStore
atomic RMW
Note: My 66000 performs these on behalf of the programmer without a
MEMBAR instruction.
Is your 66000 automatically seq_cst? Or is it TSO?
Simply because the explicit MEMBAR combinations I posted for
acquire/release are designed for a fully relaxed model (RMO). On a SPARC running in TSO mode, those specific barriers should safely degrade into hardware no-ops because the arch natively enforces those orderings.
If you 66000 doesn't need them, does it mean your atomic rmw's
automatically carry acquire/release tracking in the pipeline, or is the whole arch just running on a stronger global memory model by default?
Try to avoid #StoreLoad at all costs, unless absolutely necessary.
"Chris M. Thomasson" <[email protected]> posted:
On 6/5/2026 6:47 PM, MitchAlsup wrote:
"Chris M. Thomasson"<[email protected]> posted:
Does not matter what memory model you are using RMO, PSO, TSO... I was >>>> always taught to program for it as if its always in RMO mode. So, if you >>>> want a atomic RMW with acquire semantics, we had to:Make sure all out-of-order LDs have been made visible system-wide
atomic RMW
MEMBAR #LoadStore | #LoadLoad
after the atomic so that Lamport's criterion has been met.
Same as above but s/LDs/STs/
For release:
MEMBAR #LoadStore | #StoreStore
atomic RMW
Note: My 66000 performs these on behalf of the programmer without a
MEMBAR instruction.
Is your 66000 automatically seq_cst? Or is it TSO?
Cacheable is more relaxed than TSO when just accessing memory.
When the first LL is decoded, the LL address cannot leave the
core until all older memory addresses have left the core.
When SC is decoded, it will leave the core before any younger
memory references leave the core. And when there are multiple
participating cache lines, all the intermediate SCs become
visible at the same instant.
So, between a LL and the SC it is sequentially consistent, otherwise
it is causal.
Accesses to MMI/O are SC, accesses to control register headers (such
as BARs) are strongly ordered.
Accesses to ROM are unordered and incoherent.
Simply because the explicit MEMBAR combinations I posted for
acquire/release are designed for a fully relaxed model (RMO). On a SPARC
running in TSO mode, those specific barriers should safely degrade into
hardware no-ops because the arch natively enforces those orderings.
If you 66000 doesn't need them, does it mean your atomic rmw's
automatically carry acquire/release tracking in the pipeline, or is the
whole arch just running on a stronger global memory model by default?
It knows when it is running an ATOMIC event and switches at the boundaries.
Try to avoid #StoreLoad at all costs, unless absolutely necessary.
On 6/6/2026 12:23 PM, MitchAlsup wrote:--------------------------
It knows when it is running an ATOMIC event and switches at the boundaries.
How would this common publish/consume pattern work on your 66000?
Note: all atomic RMW are naked on the 66000 — no built-in memory order visibility on the atomics themselves.
Pseudo-Code, sorry:
_______________
// say for fun... :^)
// Visible to a specific core cluster
// (e.g. cpuA cores 0-5, cpuB cores 3-7, etc.)
g_p0 = nullptr;
// Thread 1 (producer)
// ... initialize p0 and do work ...
// Publish
release_barrier(); // or equivalent on 66000
atomic_store(&g_p0, p0);
//________________
// Thread 2 (consumer)
l0 = atomic_load(&g_p0);
if (l0)BEQ label0
{
acquire_barrier(); // or equivalent
l0->wizzfroboz();
}label0:
"Chris M. Thomasson" <[email protected]> posted:
On 6/6/2026 12:23 PM, MitchAlsup wrote:--------------------------
When used below, that pointer needs a different value/address.It knows when it is running an ATOMIC event and switches at the boundaries. >>
How would this common publish/consume pattern work on your 66000?
Note: all atomic RMW are naked on the 66000 — no built-in memory order
visibility on the atomics themselves.
Pseudo-Code, sorry:
_______________
// say for fun... :^)
// Visible to a specific core cluster
// (e.g. cpuA cores 0-5, cpuB cores 3-7, etc.)
g_p0 = nullptr;
// Thread 1 (producer)
// ... initialize p0 and do work ...
// Publish
release_barrier(); // or equivalent on 66000
What functionality is ascribed to release_barrier ?
On 6/8/2026 11:59 AM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/6/2026 12:23 PM, MitchAlsup wrote:--------------------------
When used below, that pointer needs a different value/address.It knows when it is running an ATOMIC event and switches at the boundaries.
How would this common publish/consume pattern work on your 66000?
Note: all atomic RMW are naked on the 66000 — no built-in memory order >> visibility on the atomics themselves.
Pseudo-Code, sorry:
_______________
// say for fun... :^)
// Visible to a specific core cluster
// (e.g. cpuA cores 0-5, cpuB cores 3-7, etc.)
g_p0 = nullptr;
// Thread 1 (producer)
// ... initialize p0 and do work ...
// Publish
release_barrier(); // or equivalent on 66000
What functionality is ascribed to release_barrier ?
Fwiw, the exact same as on the SPARC.
MEMBAR #LoadStore | #StoreStore
atomic RMW
The barrier doesn't need a variable identifier because it acts as a
fence on the core's memory execution pipeline, forcing prior writes to
drain before subsequent writes can execute.
Please take careful note that the barrier must be placed before the
atomic logic.
For the acquire its:
atomic RMW
MEMBAR #LoadStore | #LoadLoad
Please take careful note that the barrier must be placed after the
atomic logic to prevent speculative reads from leaking backward in the pipeline.
C++ adopted this exact decoupled fence paradigm with std::atomic_thread_fence:
https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence
Take note that no heavy #StoreLoad order has to be used for this classic publish/consume pattern. Just like a mutex. Well, Peterson's aside for a moment...
acquire/release is NOT strong enough to order store followed by a load
to another location.
[snip what I have to ponder on that wrt your arch]
I will get back to you. Busy with some other work.
The SPARC is free form. Now, tagging a variable identifier wrt the
membars might be more efficient, but how does your system work with a
mutex to do that? Say the locked region is comprised of several
unrelated variables?
"Chris M. Thomasson" <[email protected]> posted:
On 6/8/2026 11:59 AM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/6/2026 12:23 PM, MitchAlsup wrote:--------------------------
When used below, that pointer needs a different value/address.It knows when it is running an ATOMIC event and switches at the boundaries.
How would this common publish/consume pattern work on your 66000?
Note: all atomic RMW are naked on the 66000 — no built-in memory order >>>> visibility on the atomics themselves.
Pseudo-Code, sorry:
_______________
// say for fun... :^)
// Visible to a specific core cluster
// (e.g. cpuA cores 0-5, cpuB cores 3-7, etc.)
g_p0 = nullptr;
// Thread 1 (producer)
// ... initialize p0 and do work ...
// Publish
release_barrier(); // or equivalent on 66000
What functionality is ascribed to release_barrier ?
Fwiw, the exact same as on the SPARC.
MEMBAR #LoadStore | #StoreStore
atomic RMW
The barrier doesn't need a variable identifier because it acts as a
fence on the core's memory execution pipeline, forcing prior writes to
drain before subsequent writes can execute.
Please take careful note that the barrier must be placed before the
atomic logic.
For the acquire its:
atomic RMW
MEMBAR #LoadStore | #LoadLoad
Please take careful note that the barrier must be placed after the
atomic logic to prevent speculative reads from leaking backward in the
pipeline.
In this case both *_barrier are NoOps.
C++ adopted this exact decoupled fence paradigm with
std::atomic_thread_fence:
https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence
Take note that no heavy #StoreLoad order has to be used for this classic
publish/consume pattern. Just like a mutex. Well, Peterson's aside for a
moment...
acquire/release is NOT strong enough to order store followed by a load
to another location.
[snip what I have to ponder on that wrt your arch]
I will get back to you. Busy with some other work.
The SPARC is free form. Now, tagging a variable identifier wrt the
membars might be more efficient, but how does your system work with a
mutex to do that? Say the locked region is comprised of several
unrelated variables?
The core runs nominally in causal order.
When an ATOMIC event starts (with a LL) the core reverts to sequentially consistent. All older memory references have to have left the core (L1
and TLB) before the LL can leave the core.
When an ATOMIC event ends (with a SC) the core reverts to causal. All participating lines become visible in the instant the SC is performed,
while no references younger than the event can leave the core before
the SC.
In effect, the core inserts the MEMBARs on behalf of the program at
changes to the ATOMIC-event status.
--------------------------------
Note: the core runs in one of 3 defined modes {Optimistic, careful,
and methodological}. At completion of an ATOMIC-event (or context
switch into) core reverts to optimistic.
In Optimistic mode, core tries to barrel through the event, and if
nobody saw it pass through, then all is good. However, if anybody
interfered with the passage through the event, the first event fails,
control is transferred to the Atomic-Control-point, and code continues
in careful mode.
{{The Atomic-Control-Point is the address of the first LL instruction
unless a Branch-on-interference is performed which changes the ACP
to the label of the branch.}}
In Careful mode, the core enters the sequentially consistent state and carefully orders references inserting MEMBAR at the LL and another at
the SC. If this fails, core enters Methodological mode, if success,
core reverts to Optimistic.
In methodological mode, core touches the participating inbound memory references, and when it finds the point-of-resolution, it bundles the participating addresses and ships them off to a system arbiter. The
arbiter grants all (or none) and puts the core in a position to NaK interfering requests to its granted lines. A Granted ATOMIC event will succeed. Once finished the core reverts to Optimistic.
{{The arbiter is much like a TLB in size and circuit organization.
The Arbiter processes requests in arrival order, and returns grants
in arrival order. Processes that don't share memory use independent arbiters.}}
{{The point-of-resolution follows SW-instructions touching each par- ticipating line and precedes the first ST to a participating line}}
On 6/9/2026 6:16 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/8/2026 11:59 AM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 6/6/2026 12:23 PM, MitchAlsup wrote:--------------------------
When used below, that pointer needs a different value/address.It knows when it is running an ATOMIC event and switches at the boundaries.
How would this common publish/consume pattern work on your 66000?
Note: all atomic RMW are naked on the 66000 — no built-in memory order >>>> visibility on the atomics themselves.
Pseudo-Code, sorry:
_______________
// say for fun... :^)
// Visible to a specific core cluster
// (e.g. cpuA cores 0-5, cpuB cores 3-7, etc.)
g_p0 = nullptr;
// Thread 1 (producer)
// ... initialize p0 and do work ...
// Publish
release_barrier(); // or equivalent on 66000
What functionality is ascribed to release_barrier ?
Fwiw, the exact same as on the SPARC.
MEMBAR #LoadStore | #StoreStore
atomic RMW
The barrier doesn't need a variable identifier because it acts as a
fence on the core's memory execution pipeline, forcing prior writes to
drain before subsequent writes can execute.
Please take careful note that the barrier must be placed before the
atomic logic.
For the acquire its:
atomic RMW
MEMBAR #LoadStore | #LoadLoad
Please take careful note that the barrier must be placed after the
atomic logic to prevent speculative reads from leaking backward in the
pipeline.
In this case both *_barrier are NoOps.
C++ adopted this exact decoupled fence paradigm with
std::atomic_thread_fence:
https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence
Take note that no heavy #StoreLoad order has to be used for this classic >> publish/consume pattern. Just like a mutex. Well, Peterson's aside for a >> moment...
acquire/release is NOT strong enough to order store followed by a load
to another location.
[snip what I have to ponder on that wrt your arch]
I will get back to you. Busy with some other work.
The SPARC is free form. Now, tagging a variable identifier wrt the
membars might be more efficient, but how does your system work with a
mutex to do that? Say the locked region is comprised of several
unrelated variables?
The core runs nominally in causal order.
When an ATOMIC event starts (with a LL) the core reverts to sequentially consistent. All older memory references have to have left the core (L1
and TLB) before the LL can leave the core.
When an ATOMIC event ends (with a SC) the core reverts to causal. All participating lines become visible in the instant the SC is performed, while no references younger than the event can leave the core before
the SC.
In effect, the core inserts the MEMBARs on behalf of the program at
changes to the ATOMIC-event status.
--------------------------------
Note: the core runs in one of 3 defined modes {Optimistic, careful,
and methodological}. At completion of an ATOMIC-event (or context
switch into) core reverts to optimistic.
In Optimistic mode, core tries to barrel through the event, and if
nobody saw it pass through, then all is good. However, if anybody interfered with the passage through the event, the first event fails, control is transferred to the Atomic-Control-point, and code continues
in careful mode.
{{The Atomic-Control-Point is the address of the first LL instruction unless a Branch-on-interference is performed which changes the ACP
to the label of the branch.}}
In Careful mode, the core enters the sequentially consistent state and carefully orders references inserting MEMBAR at the LL and another at
the SC. If this fails, core enters Methodological mode, if success,
core reverts to Optimistic.
Sorry for the quick question. Will get back to you on this. Its
interesting. So, you say seq_cst. So, it will automatically handle the
store followed by a load to another location that TSO cannot handle? On
the SPARC that requires a damn #StoreLoad. So, we try to avoid that. No matter what arch. But if the arch is automatically seq_cst in that area, then well... How does it compare to a tight algo, say RCU that can be
used highly efficiently on a weak order system. It does not need
seq_cst, or even acquire/release membars at all. It just need load order dependency.
In methodological mode, core touches the participating inbound memory references, and when it finds the point-of-resolution, it bundles the participating addresses and ships them off to a system arbiter. The
arbiter grants all (or none) and puts the core in a position to NaK interfering requests to its granted lines. A Granted ATOMIC event will succeed. Once finished the core reverts to Optimistic.
{{The arbiter is much like a TLB in size and circuit organization.
The Arbiter processes requests in arrival order, and returns grants
in arrival order. Processes that don't share memory use independent arbiters.}}
{{The point-of-resolution follows SW-instructions touching each par- ticipating line and precedes the first ST to a participating line}}
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,123 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 34:32:33 |
| Calls: | 14,371 |
| Files: | 186,380 |
| D/L today: |
1,057 files (297M bytes) |
| Messages: | 2,540,615 |