Forum: War Ensemble BBS

Re: Should an ISA contain

From Chris M. Thomasson@[email protected] to comp.arch on Fri May 15 14:13:32 2026

From Newsgroup: comp.arch

On 5/14/2026 10:22 AM, BGB wrote:

On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:

On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.

Also the Motorola 68040.

Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.

One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say “self-modifying code” as soon as I
mentioned this, even though the two are quite different things.

I think it’s quite desirable that an architecture guarantees that an
(to coin a phrase) “instruction view” versus a “data view” of the same
memory location will never show different values.

Sometimes, there is a difference between "nice to have", vs "cost effective".

It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).

The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.

Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.

Say, typical scenario with incoherent caches:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache sends a copy to B.
A and B now have incoherent copies.

Versus Say:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache rejects B's Request;
L2 Cache sense a request to A to write line back;
Core A writes line back (flushing it locally);
(Maybe) L2 signals to Core B that the line is now available.
Core B Requests Line again (retry);
L2 Cache sends a copy to B.

In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
Core A requests a line for Volatile Write;
Core B Requests Line (also for Volatile Write);
L2 Cache sends a copy to A;
L2 Cache ignores B's Request (it can cycle the ring some more);
    L2 cache can track volatile lines and see that it is in-use.
Core A writes back line and flushes local copy;
    L2 cache then marks the volatile access as complete.
L2 Cache sends a copy to B
    Via the original request cycling around and hitting L2 again
Core B writes back line and flushes local copy;
    L2 cache then marks the volatile access as complete.

Because volatile accesses flush the cached dirty lines immediately, this means that there is a performance penalty, but these accesses can remain coherent (but without the impact of trying to make all memory coherent).

For something like an inter-processor JIT, this would alas still require flushing the L1 caches in a way that is coordinated between threads.

Normally, the mutex mechanism does not include I$ flushes, though one possibility could be to have, say, a separate JIT mutex lock, where if threads (upon trying to lock a mutex) see a JIT Sequence Number that
does not match the expected value for that mutex on that processor core,
it also triggers an I$ flush.

Say:
JIT Lock:
    Flush Caches;
    Lock Mutex;
    Increment JIT Sequence Number (JSN).
Do stuff;
Flush Caches;
Unlock Mutex;
    Flush Caches;
    Set mutex to unlocked.

Lock Mutex (Normal):
    Flush Caches;

Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.

    Lock Mutex;
    Check JSN against cores' current JSN;
      If mismatch, flush I$ and update core's JSN.
      Likely all via CPUID and a lookup table, not new arch.
Do Stuff;
Unlock Mutex:
    ...
...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@[email protected] to comp.arch on Fri May 15 13:11:02 2026

From Newsgroup: comp.arch

Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.

Can't the GBOoO send the LD out early/speculatively, and do a kind of branch-recovery if that memory location is later modified is a way that
changes what the LD should have received?

Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 16 05:57:47 2026

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> writes:

Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.

Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?

Of course it can, although I would not call it "branch" recovery.

The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.

Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.

Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.

Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.

Concerning performance costs, whenever a conflict is detected, one way
of recovery would be to reset all cores to the architectural state of
the last snapshot before the conflict happened. One can probably find
less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
hold back the corresponding load such that no conflict happens and no
reset is necessary. Another might be to find out which cores
communicate, and only reset those that have talked to each other since
the snapshot.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat May 16 18:04:55 2026

From Newsgroup: comp.arch

[email protected] (Anton Ertl) posted:

Stefan Monnier <[email protected]> writes:

Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.

Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?

Of course it can, although I would not call it "branch" recovery.

The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.

Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.

Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
a failure. How does one recover that ??

So, yes, you can recover this CPU's state, but no, you cannot precisely
recover the other CPU's state precisely.

Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.

And that only recovers the state, not the intent of the state (above).

Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.

Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually
Consistent machine.

Concerning performance costs, whenever a conflict is detected, one way
of recovery would be to reset all cores to the architectural state of
the last snapshot before the conflict happened.

Just broadcasting that it needs to be recovered on a multi-chip multi- processor is going to take on the order of 1000 instructions (200 ns).

One can probably find
less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
hold back the corresponding load such that no conflict happens and no
reset is necessary.

Hardwiring ST-to-LD 'pin' ordering is, in effect, Sequential Consistency.

Another might be to find out which cores
communicate, and only reset those that have talked to each other since
the snapshot.

Given 256 cores across 4-chips, this represents 256,000 instructions
of recovery buffering. ... So, I doubt this is practicable even if
feasible.

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Sat May 16 20:50:49 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

[email protected] (Anton Ertl) posted:

Stefan Monnier <[email protected]> writes:

Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.

Can't the GBOoO send the LD out early/speculatively, and do a kind of
branch-recovery if that memory location is later modified is a way that
changes what the LD should have received?

Of course it can, although I would not call it "branch" recovery.

The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.

Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.

Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
a failure. How does one recover that ??

So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.

Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core
communication. E.g., I see at
<https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.

And that only recovers the state, not the intent of the state (above).

Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.

Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually
Consistent machine.

Casually -> Causally ?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat May 16 23:01:00 2026

From Newsgroup: comp.arch

Terje Mathisen <[email protected]> posted:

MitchAlsup wrote:

[email protected] (Anton Ertl) posted:

Stefan Monnier <[email protected]> writes:

Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.

Can't the GBOoO send the LD out early/speculatively, and do a kind of
branch-recovery if that memory location is later modified is a way that >>> changes what the LD should have received?

Of course it can, although I would not call it "branch" recovery.

The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.

Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.

Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as a failure. How does one recover that ??

So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.

Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core
communication. E.g., I see at
<https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.

And that only recovers the state, not the intent of the state (above).

Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.

Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually Consistent machine.

Casually -> Causally ?

Friggen spelling corrector.....

Terje

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Sun May 17 12:16:04 2026

From Newsgroup: comp.arch

On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:

On 5/14/2026 10:22 AM, BGB wrote:

On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:

On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.

Also the Motorola 68040.

Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.

One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say “self-modifying code” as soon as I
mentioned this, even though the two are quite different things.

I think it’s quite desirable that an architecture guarantees that an
(to coin a phrase) “instruction view” versus a “data view” of the same
memory location will never show different values.

Sometimes, there is a difference between "nice to have", vs "cost
effective".

It is nicer, say, to have I$/D$ coherence, and to not require explicit
flushing and invalidation. Or, say, to have caches that are implicitly
coherent between threads (Core A stores to a location, Core B loads
from that location, Core B sees what Core A stored).

The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.

Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is
not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer
bandwidth on the bus.

Say, typical scenario with incoherent caches:
   Core A Requests Line (for Write);
   Core B Requests Line (also for Write);
   L2 Cache sends a copy to A;
   L2 Cache sends a copy to B.
A and B now have incoherent copies.

Versus Say:
   Core A Requests Line (for Write);
   Core B Requests Line (also for Write);
   L2 Cache sends a copy to A;
   L2 Cache rejects B's Request;
   L2 Cache sense a request to A to write line back;
   Core A writes line back (flushing it locally);
   (Maybe) L2 signals to Core B that the line is now available.
   Core B Requests Line again (retry);
   L2 Cache sends a copy to B.

In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
   Core A requests a line for Volatile Write;
   Core B Requests Line (also for Volatile Write);
   L2 Cache sends a copy to A;
   L2 Cache ignores B's Request (it can cycle the ring some more);
     L2 cache can track volatile lines and see that it is in-use.
   Core A writes back line and flushes local copy;
     L2 cache then marks the volatile access as complete.
   L2 Cache sends a copy to B
     Via the original request cycling around and hitting L2 again
   Core B writes back line and flushes local copy;
     L2 cache then marks the volatile access as complete.

Because volatile accesses flush the cached dirty lines immediately,
this means that there is a performance penalty, but these accesses can
remain coherent (but without the impact of trying to make all memory
coherent).

For something like an inter-processor JIT, this would alas still
require flushing the L1 caches in a way that is coordinated between
threads.

Normally, the mutex mechanism does not include I$ flushes, though one
possibility could be to have, say, a separate JIT mutex lock, where if
threads (upon trying to lock a mutex) see a JIT Sequence Number that
does not match the expected value for that mutex on that processor
core, it also triggers an I$ flush.

Say:
   JIT Lock:
     Flush Caches;
     Lock Mutex;
     Increment JIT Sequence Number (JSN).
   Do stuff;
   Flush Caches;
   Unlock Mutex;
     Flush Caches;
     Set mutex to unlocked.

   Lock Mutex (Normal):
     Flush Caches;

Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.

Cache Flushing on Mutex Lock:
Anything that was in-memory is now written back;
Cache is ready to accept new (non-stale data).

Cache Flush on Mutex Unlock:
Anything dirty in cache during time mutex was held is now written back;
...

This causes mutex lock/unlock to become a sort of memory ordering event.

It is sort of needed for a weak model to work for multi-core
multi-threading and not just end up exploding (and some practices will
still not work as they would on a core with stronger memory ordering and
cache coherence).

Can skip the flushing though in cases where a mutex is being used only
being used from a single core (since memory is coherent within a core).

     Lock Mutex;
     Check JSN against cores' current JSN;
       If mismatch, flush I$ and update core's JSN.
       Likely all via CPUID and a lookup table, not new arch.
   Do Stuff;
   Unlock Mutex:
     ...
   ...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun May 17 22:08:08 2026

From Newsgroup: comp.arch

BGB <[email protected]> posted:

On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:

On 5/14/2026 10:22 AM, BGB wrote:

On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:

On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

-------------------Why can't BGB clip unnecessary lines in the thread ???

Cache Flushing on Mutex Lock:
Anything that was in-memory is now written back;
Cache is ready to accept new (non-stale data).

Cache Flush on Mutex Unlock:
Anything dirty in cache during time mutex was held is now written back;
...

This causes mutex lock/unlock to become a sort of memory ordering event.

This is the means by which My 66000 presents all participating cache
line in {as before or as after} in a single instant to the rest of the
system. Here, the trigger is the STL (equivalent to SC) and a check for
no deleterious interference.

It is sort of needed for a weak model to work for multi-core
multi-threading and not just end up exploding (and some practices will
still not work as they would on a core with stronger memory ordering and cache coherence).

At the start of an ATOMIC event My 66000 reverts to sequential consistency.
At the termination My 66000 reverts to causal consistency.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@[email protected] to comp.arch on Mon May 18 17:14:52 2026

From Newsgroup: comp.arch

On 5/17/2026 10:16 AM, BGB wrote:

On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:

On 5/14/2026 10:22 AM, BGB wrote:

On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:

On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.

Also the Motorola 68040.

Upon updating instructions (e.g., from a JIT compiler), they require >>>>> that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.

One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say “self-modifying code” as soon as I
mentioned this, even though the two are quite different things.

I think it’s quite desirable that an architecture guarantees that an >>>> (to coin a phrase) “instruction view” versus a “data view” of the same
memory location will never show different values.

Sometimes, there is a difference between "nice to have", vs "cost
effective".

It is nicer, say, to have I$/D$ coherence, and to not require
explicit flushing and invalidation. Or, say, to have caches that are
implicitly coherent between threads (Core A stores to a location,
Core B loads from that location, Core B sees what Core A stored).

The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.

Say, for example, if one has coherent caches, software that depends
on the cache-coherent behavior, and much more than 2 or 4 cores, it
is not difficult to imagine scenarios where waiting on cache-
coherence mechanisms becomes a more significant cost than actual
memory-transfer bandwidth on the bus.

Say, typical scenario with incoherent caches:
   Core A Requests Line (for Write);
   Core B Requests Line (also for Write);
   L2 Cache sends a copy to A;
   L2 Cache sends a copy to B.
A and B now have incoherent copies.

Versus Say:
   Core A Requests Line (for Write);
   Core B Requests Line (also for Write);
   L2 Cache sends a copy to A;
   L2 Cache rejects B's Request;
   L2 Cache sense a request to A to write line back;
   Core A writes line back (flushing it locally);
   (Maybe) L2 signals to Core B that the line is now available.
   Core B Requests Line again (retry);
   L2 Cache sends a copy to B.

In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
   Core A requests a line for Volatile Write;
   Core B Requests Line (also for Volatile Write);
   L2 Cache sends a copy to A;
   L2 Cache ignores B's Request (it can cycle the ring some more);
     L2 cache can track volatile lines and see that it is in-use.
   Core A writes back line and flushes local copy;
     L2 cache then marks the volatile access as complete.
   L2 Cache sends a copy to B
     Via the original request cycling around and hitting L2 again
   Core B writes back line and flushes local copy;
     L2 cache then marks the volatile access as complete.

Because volatile accesses flush the cached dirty lines immediately,
this means that there is a performance penalty, but these accesses
can remain coherent (but without the impact of trying to make all
memory coherent).

For something like an inter-processor JIT, this would alas still
require flushing the L1 caches in a way that is coordinated between
threads.

Normally, the mutex mechanism does not include I$ flushes, though one
possibility could be to have, say, a separate JIT mutex lock, where
if threads (upon trying to lock a mutex) see a JIT Sequence Number
that does not match the expected value for that mutex on that
processor core, it also triggers an I$ flush.

Say:
   JIT Lock:
     Flush Caches;
     Lock Mutex;
     Increment JIT Sequence Number (JSN).
   Do stuff;
   Flush Caches;
   Unlock Mutex;
     Flush Caches;
     Set mutex to unlocked.

   Lock Mutex (Normal):
     Flush Caches;

Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire.
and #LoadStore | #StoreStore for release. No #StoreLoad ordering.

Cache Flushing on Mutex Lock:
Anything that was in-memory is now written back;
Cache is ready to accept new (non-stale data).

Cache Flush on Mutex Unlock:
Anything dirty in cache during time mutex was held is now written back;
...

This causes mutex lock/unlock to become a sort of memory ordering event.

A mutex lock/unlock does not need #StoreLoad ordering. So, keep that in
mind.

It is sort of needed for a weak model to work for multi-core multi- threading and not just end up exploding (and some practices will still
not work as they would on a core with stronger memory ordering and cache coherence).

Can skip the flushing though in cases where a mutex is being used only
being used from a single core (since memory is coherent within a core).

     Lock Mutex;
     Check JSN against cores' current JSN;
       If mismatch, flush I$ and update core's JSN.
       Likely all via CPUID and a lookup table, not new arch.
   Do Stuff;
   Unlock Mutex:
     ...
   ...

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@[email protected] to comp.arch on Tue May 19 20:20:40 2026

From Newsgroup: comp.arch

On Fri, 08 May 2026 23:34:21 +0000, MitchAlsup wrote:

Should an ISA contain an instruction that gives Write-Permission from
the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}

Should an ISA contain an instruction that invalidates (without writing
back) a Data Cache (or L2) line ?? {Discard}

I am not sure how to answer such a question.

After all, programs are written in advance of their being given to a
computer to execute. So how do you even reference a cache line?

The best you can do is point the instruction at a memory address, and
say... _if_ the data at this address is cached, then either invalidate the copy in the cache, or allow the copy in the cache to be updated when this
data is altered.

Such instructions can exist. Do they belong in an ISA? Should they be privileged, since they address the system, or, since they're used to
optimize code, are they hint instructions that ordinary programs need to
have?

Here, then, is where the answer to your question is found. Should an ISA
have these instructions? Yes, _if_ the target machine is such that it
needs this kind of hinting to help it gain the performance it is capable
of producing.

But this is such an obvious answer that you didn't need to ask if that was
all that you could get; I can't give more.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue May 19 21:04:40 2026

From Newsgroup: comp.arch

quadi <[email protected]d> posted:

On Fri, 08 May 2026 23:34:21 +0000, MitchAlsup wrote:

Should an ISA contain an instruction that gives Write-Permission from
the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}

Should an ISA contain an instruction that invalidates (without writing back) a Data Cache (or L2) line ?? {Discard}

I am not sure how to answer such a question.

After all, programs are written in advance of their being given to a computer to execute. So how do you even reference a cache line?

The best you can do is point the instruction at a memory address, and
say... _if_ the data at this address is cached, then either invalidate the copy in the cache, or allow the copy in the cache to be updated when this data is altered.

Such instructions can exist. Do they belong in an ISA? Should they be privileged, since they address the system, or, since they're used to optimize code, are they hint instructions that ordinary programs need to have?

Privileged--likely not especially when there are 4 (or more) layers of privilege.

Should HyperVisor use them in management over SuperVisor?
Should HyperVisor use them in management over user?
Should SuperVisor use them in management over User ?
Should user use them to optimize user stuff ?

Here, then, is where the answer to your question is found. Should an ISA have these instructions? Yes, _if_ the target machine is such that it
needs this kind of hinting to help it gain the performance it is capable
of producing.

But the gain of prefetching {with a dramatically easier concept}
has not demonstrated adding much performance at all.

But this is such an obvious answer that you didn't need to ask if that was all that you could get; I can't give more.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From kegs@[email protected] (Kent Dickey) to comp.arch on Wed May 20 18:06:33 2026

From Newsgroup: comp.arch

In article <[email protected]>,
MitchAlsup <[email protected]d> wrote:

Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}

This is what RISC-V/ARM/etc. calls a Clean operation--writeback the data if modified, but keep the line shared (maybe even exclusive).

The main use for this is on systems with non-coherent instruction fetches
or non-coherent DMA--it allows forcing data to the next level cache (or DDR) without having to remove it from the dcache, for modifying instructions or
for updating non-coherent DMA buffers.

You should not need it, but it shouldn't cost much to have. Although I'll
note most architectures I've used have had unique bugs with Clean operations, it often interacts in unexpected ways with other coherency traffic.
Nothing hard to find or hard to fix, but it's a source of some bugs.

Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}

This is a Bad Idea(TM). You can not let user-level code do this. I am not aware of any architecture that allows user-level code to do Data-Cache- Invalidate. Some have tried, all have had to retract it and at least
make it a privileged operation.

The security issue that's hard to fix is around OS calls. The OS can
create new data for the user, and then the user can Invalidate it and
peek behind, and see what secrets lie there. This includes things like
mmap() and allocating new pages through any means, as well as read(),
recv(), stat(), etc. This is such a leaky area you'll never plug all
of the holes. You want the OS to not micromanage the cache, so you need
to not allow the user to peek behind the cache. Yes, the user has write permissions, but the operation is not just a write, it's a go-around-the- cache-and-see-what's-in-memory, and that's quite dangerous.

Saying it's "privileged only" just means the attack surface changes to the
OS being able to attack the hypervisor (and so on). Historically, that was often considered acceptable, but I think the OS is less trusted now than 10 years ago.

Note: if your architecture effectively does Invalidates on stack frames that are no longer valid, that might be fine, but it's still an attack surface.
If user code can map the stack on top of OS-provided data, and then increment the stack pointer, and you then you invalidate it, you've just given the
user the same Invalidate operation and it can be used to attack the OS.

I would think leaving the stack lines would be best--most code will make more calls, and then you don't have to refetch ownership.

As for terminology, RISC-V and Arm basically agree: Invalidate means the
cache entry is invalid at the end; Clean means any modified data is written back. So Clean-and-Invalidate is a common cache flush (modified data written back, entry ends being invalid).

You didn't ask, but Prefetch instructions are rarely useful. I know
everyone has them, but what works is prefetch prediction hardware.
Prefetch instructions can also have unique bugs since they don't act
like regular LD/ST.

Kent
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Wed May 20 20:27:26 2026

From Newsgroup: comp.arch

On 5/14/26 10:57 AM, Scott Lurndal wrote:

Paul Clayton <[email protected]> writes:

On 5/11/26 3:29 AM, Anton Ertl wrote:

A better approach is to do just the writes. I think that zeroing the
page on demand is a good approach, because then it is already in the
D-cache, but AFAIK Linux actually zeros physical pages ahead of time
typically on a separate (otherwise idle) core, and just maps one of
those pages to the virtual page that needs to be written to. I wonder
why Linux does that.

Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one
transaction.

This is more flexible than having cache line and page clearing
instructions.

In what way is it more flexible? It is a page-clearing instruction.

My 66000's memory set instruction is not limited to a page size
defined when the instruction was generated. IBM's Data Cache
Block Zero instruction had a compatibility problem when software
written for early PowerPC caches was to be run on POWER (G5)
with 128-byte cache blocks.

ARM has a system register that software can access to determine
the cache line size for the DC ZVA instruction.

If one architecturely defines cache block size and page size,
one is stuck working around that if a different size is better.

Or, provide a mechanism for the software that performs
the zeroing to determine both the cache and page sizes
dynamically.

That is an alternative, but it requires checking that
information and acting accordingly. If software only wants to
zero an aligned 32-byte section, knowing that the DCBZ
instruction for the implementation on which the software is
running has a 128-byte cache block is not very useful (other
than avoiding wrong behavior); a sequence of zero stores would
have to be used.

If the zeroing operations are limited to a single function or a
single library — and the sizes are the same across the system —
then patching all the function calls to an alternative function
or loading a different library may be practical.

Software developers seem often enough to assume such values are
universal constants, so "working" software can become broken
software. (This may also be relate to the saying "There is
nothing more permanent than a temporary fix.")

These choices have tradeoffs. I tend to favor defining fixed
size "cache blocks" and working around such with things like
segmented cache blocks. The diversity in cache block sizes
(and even page sizes) seems to be somewhat small, so
Architecting a sub-optimal size may not be too horrible. On the
other hand, I also like avoiding having hardware handle things
that could be done by software (which is one reason I like the
idea of an intermediate software distribution format which can
be translated to fit the implementation, providing the caching
of the optimization in the directly executable format while
allowing the distribution format to be more portable).
--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@[email protected] to comp.arch on Mon May 25 01:24:49 2026

From Newsgroup: comp.arch

On 5/13/26 9:36 PM, MitchAlsup wrote:

Paul Clayton <[email protected]> posted:

[big snip]

I suspect that HW TM will never take hold of the CPU industry.

I suspect you are correct. I still think an optimistic
concurrency mechanism with at least a very large read set could
be useful.

[snip]

(The issue I have with limited optimistic concurrency mechanisms
like AMD's Advanced Synchronization Facility and My 66000's
Exotic Synchronization Mechanism is not the initial limits but
that there seems to be little presentation of an interface that
can be extended.

For example:: what ??

Increasing the size of the transaction would not (I think)
require any new instructions but it would require documentation.
The Principles of Operation version that I have seems to imply
that Architecturally ESM transactions are limited to six 64-byte
chunks rather than that such is a minimum guaranteed by the
Architecture (like ASF does).

I imagined that there might be some use for providing a scope
identifier for transactions that is denser than a list of
addresses. This might be used to facilitate ordering
optimization and perhaps even false conflict avoidance (e.g., a
large read set might be guarded by a conservative filter based
on used addresses and by the transaction scope identifier or
false sharing within a cache block might be ignored). Obviously,
this would be dangerous, comparable to not acquiring a necessary
lock, if software got the scope naming wrong.

Exporting atomic operations also seems not to be discussed.
While such is technically not Architectural, there are other
quality of implementation matters that seem important for My
66000. Knowing that certain idioms will always be recognized
and optimized to at least a defined degree (which can vary
across time) is important.

Quoting Scott Lurndal
(Message-ID: <qQJPR.1144299$[email protected]>):
| Functionality guarantees, yes. Performance has to suffer,
| unless the hardware can analyze all the instructions between
| the LL/SC and abstract them into a single bus operation; which
| I don't see as feasible.

(The virtual vector method has similar issues of the performance
profile not being clear. Documenting that a vec loop will never
perform more than 2% worse than any semantically equivalent code
may not be sufficient if a different algorithm would perform
50% better on the same hardware and give an acceptable result.
This is not really Architectural, but benchmarking every
implementation seems unattractive. Even Intel's early frequency
throttling for AVX-512 made performance choices for software
developers more difficult than ideal. I know this is not a
trivial issue for hardware developers and computer architects,
but I think it deserves more attention than it seems to get.)

If my notion of exporting part of a transaction makes sense
(like atomically incrementing a counter if a complex transaction
succeeds), how this should be expressed would need to be
documented (the same way zeroing and nop idioms are defined not
because of semantics but because of performance).

Since copying of a page can be atomic (at least for I/O?), it
seems to me that this could be integrated with ESM.

Other possible future developments may not belong in an
Architecture defining document, but having some documentation
of developmental intent and mere possibilities seems
potentially beneficial.

There may also be interactions between ESM and thread scheduling
and power management. (I do not have even an early version of
the system documentation for power management and such.) Just as
Intel introduced the PAUSE instruction in part to save power
when waiting for a lock to be released, ESM might interact with
thread priority (which I think you mentioned before) and power
use.

There is also a similarity between ESM and x86's MONITOR/MWAIT.
(The former monitors interference hoping for none so an
operation can be atomic; the later monitors interference hoping
for some so that another chunk of work can be started.) I do not
think My 66000 defined anything like MWAIT.

I would guess that others would have some ideas for ways that
ESM might be extended.

[snip]

Of course, just as early broad software
abstractions present the risk of choosing the wrong abstraction
from lack of experience, having too many exceptional cases, and
delaying release, an ISA can be designed with excessive
flexibility that is not exploited much later and has immediate
costs.)

That is the problem when you have only been working on it for 22 years----------------alone---------------without feedback

Sigh. Sometimes I wish I had more computer architecture
expertise. Even if I could not help develop a better synchronization/communication interface or mechanism, I might at
least contribute to the state of the art in some way.

[snip]

It never ceased to amaze me that Solaris would not boot without a
real TLM in the simulator. Just referencing all the right mmory
where the tables were stored (using the CR holding said pointer)
was not enough--you had to have a TLB with at least 5 FA entries.

I wish all the experience of people like you was gathered
together for future generations. Yes, the Computer History
Museum has a lot of oral histories, and a.f.c. and other parts
of USENET are probably archived, but it seems a lot of lore
is lost.

[snip]

Mitch considers TM to be a SW problem and My 66000 ISA supports SW
by allowing multiple lines to participate in a TM transaction,
without over constraining how SW gets its job done, and with enough
HW defined behavior that SW can make a robust system with it. Other
than that TM is a SW problem.

I agree that general transactional memory is a software problem,
but I think a lot of aspects can assisted by hardware. E.g., a
conservative filter of read addresses is harder to do in
software. Read-Copy-Update methods (which seems to present a
limited form of versioned memory) may also be amenable to
hardware assistance of some kind.

Cliff Click's "IWannaBit!" (2008) opens with:
| Just One Lousy Bit! I want to know if any memory operation
| misses or any line in my L1 cache gets evicted. Why? Because
| with this one Bit I can write any number of lock-free
| algorithms easily. This Bit gives me an N-word atomic read
| set, and with a typical Store Conditional instruction a 1-word
| atomic write set. The algorithm writing community has begged
| for D-CAS or Hardware Transactional Memory for years, but
| proposals far out-strip implementations: neither are available
| on any commodity system. With this Bit I hope to lower the
| hardware costs as low as possible while still being useful.

That proposal was in my opinion too small in that it failed a
transaction on any cache miss (so the cache had to be warmed up
before a transaction could succeed). At minimum the cache block
of the starting instruction could be a non-failing cache miss,
allowing fast single-block atomics. Yet it is more powerful
than ESM in one very limited way: the capacity of the read set
can be much larger.

As a side note, Cliff Click worked at Azul Systems, which
sold a JAVA-targeted processor that supported transactional
memory. A lot of software work was required to take software
counters out of atomic sections because such produced
interference from the counter being shared, but a lot of
software was not changed and that hurt the performance of
transactional memory. With locks, atomic performance counters
are nearly free; with transactional memory, this design choice
was sub-optimal.
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,123
Nodes:	10 (0 / 10)
Uptime:	35:35:16
Calls:	14,371
Files:	186,380
D/L today:	1,555 files (469M bytes)
Messages:	2,540,636

Re: Should an ISA contain

Who's Online

System Info