• Re: Should an ISA contain

    From Chris M. Thomasson@[email protected] to comp.arch on Fri May 15 14:13:32 2026
    From Newsgroup: comp.arch

    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say “self-modifying code” as soon as I
    mentioned this, even though the two are quite different things.

    I think it’s quite desirable that an architecture guarantees that an
    (to coin a phrase) “instruction view” versus a “data view” of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.


    Say, typical scenario with incoherent caches:
      Core A Requests Line (for Write);
      Core B Requests Line (also for Write);
      L2 Cache sends a copy to A;
      L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Versus Say:
      Core A Requests Line (for Write);
      Core B Requests Line (also for Write);
      L2 Cache sends a copy to A;
      L2 Cache rejects B's Request;
      L2 Cache sense a request to A to write line back;
      Core A writes line back (flushing it locally);
      (Maybe) L2 signals to Core B that the line is now available.
      Core B Requests Line again (retry);
      L2 Cache sends a copy to B.


    In my approach, I went with incoherent caches, but with a special
    Volatile mechanism for some cases, say:
      Core A requests a line for Volatile Write;
      Core B Requests Line (also for Volatile Write);
      L2 Cache sends a copy to A;
      L2 Cache ignores B's Request (it can cycle the ring some more);
        L2 cache can track volatile lines and see that it is in-use.
      Core A writes back line and flushes local copy;
        L2 cache then marks the volatile access as complete.
      L2 Cache sends a copy to B
        Via the original request cycling around and hitting L2 again
      Core B writes back line and flushes local copy;
        L2 cache then marks the volatile access as complete.

    Because volatile accesses flush the cached dirty lines immediately, this means that there is a performance penalty, but these accesses can remain coherent (but without the impact of trying to make all memory coherent).


    For something like an inter-processor JIT, this would alas still require flushing the L1 caches in a way that is coordinated between threads.

    Normally, the mutex mechanism does not include I$ flushes, though one possibility could be to have, say, a separate JIT mutex lock, where if threads (upon trying to lock a mutex) see a JIT Sequence Number that
    does not match the expected value for that mutex on that processor core,
    it also triggers an I$ flush.

    Say:
      JIT Lock:
        Flush Caches;
        Lock Mutex;
        Increment JIT Sequence Number (JSN).
      Do stuff;
      Flush Caches;
      Unlock Mutex;
        Flush Caches;
        Set mutex to unlocked.





      Lock Mutex (Normal):
        Flush Caches;


    Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.

        Lock Mutex;
        Check JSN against cores' current JSN;
          If mismatch, flush I$ and update core's JSN.
          Likely all via CPUID and a lookup table, not new arch.
      Do Stuff;
      Unlock Mutex:
        ...
      ...



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Fri May 15 13:11:02 2026
    From Newsgroup: comp.arch

    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of branch-recovery if that memory location is later modified is a way that
    changes what the LD should have received?

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious
    to me that it would necessarily be impractical.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat May 16 05:57:47 2026
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of
    thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious
    to me that it would necessarily be impractical.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Concerning performance costs, whenever a conflict is detected, one way
    of recovery would be to reset all cores to the architectural state of
    the last snapshot before the conflict happened. One can probably find
    less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
    hold back the corresponding load such that no conflict happens and no
    reset is necessary. Another might be to find out which cores
    communicate, and only reset those that have talked to each other since
    the snapshot.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat May 16 18:04:55 2026
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    Stefan Monnier <[email protected]> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious
    to me that it would necessarily be impractical.

    Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
    a failure. How does one recover that ??

    So, yes, you can recover this CPU's state, but no, you cannot precisely
    recover the other CPU's state precisely.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    And that only recovers the state, not the intent of the state (above).

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Given that ST to LD ordering is an inherent part of SC, a SC machine
    will not be able to use as large an execution window as a Casually
    Consistent machine.

    Concerning performance costs, whenever a conflict is detected, one way
    of recovery would be to reset all cores to the architectural state of
    the last snapshot before the conflict happened.

    Just broadcasting that it needs to be recovered on a multi-chip multi- processor is going to take on the order of 1000 instructions (200 ns).

    One can probably find
    less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
    hold back the corresponding load such that no conflict happens and no
    reset is necessary.

    Hardwiring ST-to-LD 'pin' ordering is, in effect, Sequential Consistency.

    Another might be to find out which cores
    communicate, and only reset those that have talked to each other since
    the snapshot.

    Given 256 cores across 4-chips, this represents 256,000 instructions
    of recovery buffering. ... So, I doubt this is practicable even if
    feasible.


    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat May 16 20:50:49 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    [email protected] (Anton Ertl) posted:

    Stefan Monnier <[email protected]> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of
    branch-recovery if that memory location is later modified is a way that
    changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of
    thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.

    Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
    a failure. How does one recover that ??

    So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core
    communication. E.g., I see at
    <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    And that only recovers the state, not the intent of the state (above).

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Given that ST to LD ordering is an inherent part of SC, a SC machine
    will not be able to use as large an execution window as a Casually
    Consistent machine.

    Casually -> Causally ?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat May 16 23:01:00 2026
    From Newsgroup: comp.arch


    Terje Mathisen <[email protected]> posted:

    MitchAlsup wrote:

    [email protected] (Anton Ertl) posted:

    Stefan Monnier <[email protected]> writes:
    Consider a GBOoO machine under sequential consistency, a LD which
    can have its address calculated early cannot leave the CPU area
    until all older stores currently in flight have left the CPU area.
    This would dramatically add to L1 cache miss latency, and would
    add moderately to L1 cache hit latency.

    Can't the GBOoO send the LD out early/speculatively, and do a kind of
    branch-recovery if that memory location is later modified is a way that >>> changes what the LD should have received?

    Of course it can, although I would not call it "branch" recovery.

    The person you cited without attribution (to protect the guilty?)
    exhibits what I called the laziness of hardware designers: Instead of
    thinking how to implement sequential consistency efficiently, they
    think about rationalizations for not doing so.

    Of course, that too comes with a cost (that of keeping track of all
    those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.

    Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as a failure. How does one recover that ??

    So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.

    Yes, the whole architectural state of the core would have to be reset.
    The major challenge for using the classical implementation of
    speculative execution (with, register renaming, speculative store
    buffer, and reorder buffer) is the worst-case latency of inter-core
    communication. E.g., I see at
    <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
    the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
    (multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
    newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
    range, and I expect that if an architecture provides sequential
    consistency, there are more incentives to bring that latency number
    down. OTOH, with multi-socket machines, the latency tends to be
    higher. Anyway, let's work with the 90ns number. That's about 500
    cycles at the higher Zen5 clock rates, and is 4000 potential
    instruction slots; the Zen5 ROB only has 448 entries, so one probably
    will not extend the ROB approach to deal with sequential consistency.
    A snapshot-and-recovery mechanism might work, based on epochs on the
    order of the maximum communication latency.

    And that only recovers the state, not the intent of the state (above).

    Then we have to think about how to prevent (not mitigate) Spectre for
    such a mechanism; yes, hardware designers currently don't do anything
    about preventing Spectre, and they probably will not do anything if
    they ever implement sequential consistency, but I think they should,
    and so I also think that one needs a way to implement sequential
    consistency efficiently that can be combined with an efficient
    prevention of Spectre. Note how speculative side channel attacks were
    the final death sentence for TSX.

    Given that ST to LD ordering is an inherent part of SC, a SC machine
    will not be able to use as large an execution window as a Casually Consistent machine.

    Casually -> Causally ?

    Friggen spelling corrector.....

    Terje


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun May 17 12:16:04 2026
    From Newsgroup: comp.arch

    On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:
    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require
    that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say “self-modifying code” as soon as I
    mentioned this, even though the two are quite different things.

    I think it’s quite desirable that an architecture guarantees that an
    (to coin a phrase) “instruction view” versus a “data view” of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost
    effective".

    It is nicer, say, to have I$/D$ coherence, and to not require explicit
    flushing and invalidation. Or, say, to have caches that are implicitly
    coherent between threads (Core A stores to a location, Core B loads
    from that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence
    mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends on
    the cache-coherent behavior, and much more than 2 or 4 cores, it is
    not difficult to imagine scenarios where waiting on cache-coherence
    mechanisms becomes a more significant cost than actual memory-transfer
    bandwidth on the bus.


    Say, typical scenario with incoherent caches:
       Core A Requests Line (for Write);
       Core B Requests Line (also for Write);
       L2 Cache sends a copy to A;
       L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Versus Say:
       Core A Requests Line (for Write);
       Core B Requests Line (also for Write);
       L2 Cache sends a copy to A;
       L2 Cache rejects B's Request;
       L2 Cache sense a request to A to write line back;
       Core A writes line back (flushing it locally);
       (Maybe) L2 signals to Core B that the line is now available.
       Core B Requests Line again (retry);
       L2 Cache sends a copy to B.


    In my approach, I went with incoherent caches, but with a special
    Volatile mechanism for some cases, say:
       Core A requests a line for Volatile Write;
       Core B Requests Line (also for Volatile Write);
       L2 Cache sends a copy to A;
       L2 Cache ignores B's Request (it can cycle the ring some more);
         L2 cache can track volatile lines and see that it is in-use.
       Core A writes back line and flushes local copy;
         L2 cache then marks the volatile access as complete.
       L2 Cache sends a copy to B
         Via the original request cycling around and hitting L2 again
       Core B writes back line and flushes local copy;
         L2 cache then marks the volatile access as complete.

    Because volatile accesses flush the cached dirty lines immediately,
    this means that there is a performance penalty, but these accesses can
    remain coherent (but without the impact of trying to make all memory
    coherent).


    For something like an inter-processor JIT, this would alas still
    require flushing the L1 caches in a way that is coordinated between
    threads.

    Normally, the mutex mechanism does not include I$ flushes, though one
    possibility could be to have, say, a separate JIT mutex lock, where if
    threads (upon trying to lock a mutex) see a JIT Sequence Number that
    does not match the expected value for that mutex on that processor
    core, it also triggers an I$ flush.

    Say:
       JIT Lock:
         Flush Caches;
         Lock Mutex;
         Increment JIT Sequence Number (JSN).
       Do stuff;
       Flush Caches;
       Unlock Mutex;
         Flush Caches;
         Set mutex to unlocked.





       Lock Mutex (Normal):
         Flush Caches;


    Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.


    Cache Flushing on Mutex Lock:
    Anything that was in-memory is now written back;
    Cache is ready to accept new (non-stale data).

    Cache Flush on Mutex Unlock:
    Anything dirty in cache during time mutex was held is now written back;
    ...

    This causes mutex lock/unlock to become a sort of memory ordering event.


    It is sort of needed for a weak model to work for multi-core
    multi-threading and not just end up exploding (and some practices will
    still not work as they would on a core with stronger memory ordering and
    cache coherence).

    Can skip the flushing though in cases where a mutex is being used only
    being used from a single core (since memory is coherent within a core).


         Lock Mutex;
         Check JSN against cores' current JSN;
           If mismatch, flush I$ and update core's JSN.
           Likely all via CPUID and a lookup table, not new arch.
       Do Stuff;
       Unlock Mutex:
         ...
       ...




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun May 17 22:08:08 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:
    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
    -------------------Why can't BGB clip unnecessary lines in the thread ???
    Cache Flushing on Mutex Lock:
    Anything that was in-memory is now written back;
    Cache is ready to accept new (non-stale data).

    Cache Flush on Mutex Unlock:
    Anything dirty in cache during time mutex was held is now written back;
    ...

    This causes mutex lock/unlock to become a sort of memory ordering event.

    This is the means by which My 66000 presents all participating cache
    line in {as before or as after} in a single instant to the rest of the
    system. Here, the trigger is the STL (equivalent to SC) and a check for
    no deleterious interference.

    It is sort of needed for a weak model to work for multi-core
    multi-threading and not just end up exploding (and some practices will
    still not work as they would on a core with stronger memory ordering and cache coherence).

    At the start of an ATOMIC event My 66000 reverts to sequential consistency.
    At the termination My 66000 reverts to causal consistency.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Mon May 18 17:14:52 2026
    From Newsgroup: comp.arch

    On 5/17/2026 10:16 AM, BGB wrote:
    On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:
    On 5/14/2026 10:22 AM, BGB wrote:
    On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
    On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory.

    Also the Motorola 68040.

    Upon updating instructions (e.g., from a JIT compiler), they require >>>>> that the modifying thread(s) write the lines back from the data
    cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on
    the executing threads are extremely difficult to meet if the
    executing threads run independently of the modifying thread(s). Or,
    in short, IA-32 and AMD64 did the right architecture for that.

    One technique for implementing lexical binding and functions as
    first-class objects involves generating code at run-time. Some people
    would immediately gasp and say “self-modifying code” as soon as I
    mentioned this, even though the two are quite different things.

    I think it’s quite desirable that an architecture guarantees that an >>>> (to coin a phrase) “instruction view” versus a “data view” of the same
    memory location will never show different values.

    Sometimes, there is a difference between "nice to have", vs "cost
    effective".

    It is nicer, say, to have I$/D$ coherence, and to not require
    explicit flushing and invalidation. Or, say, to have caches that are
    implicitly coherent between threads (Core A stores to a location,
    Core B loads from that location, Core B sees what Core A stored).


    The requirements to pull all this off in practice may add significant
    costs; and also in ways where the performance cost of the coherence
    mechanisms tend to scale upwards as core counts increase.

    Say, for example, if one has coherent caches, software that depends
    on the cache-coherent behavior, and much more than 2 or 4 cores, it
    is not difficult to imagine scenarios where waiting on cache-
    coherence mechanisms becomes a more significant cost than actual
    memory-transfer bandwidth on the bus.


    Say, typical scenario with incoherent caches:
       Core A Requests Line (for Write);
       Core B Requests Line (also for Write);
       L2 Cache sends a copy to A;
       L2 Cache sends a copy to B.
    A and B now have incoherent copies.

    Versus Say:
       Core A Requests Line (for Write);
       Core B Requests Line (also for Write);
       L2 Cache sends a copy to A;
       L2 Cache rejects B's Request;
       L2 Cache sense a request to A to write line back;
       Core A writes line back (flushing it locally);
       (Maybe) L2 signals to Core B that the line is now available.
       Core B Requests Line again (retry);
       L2 Cache sends a copy to B.


    In my approach, I went with incoherent caches, but with a special
    Volatile mechanism for some cases, say:
       Core A requests a line for Volatile Write;
       Core B Requests Line (also for Volatile Write);
       L2 Cache sends a copy to A;
       L2 Cache ignores B's Request (it can cycle the ring some more);
         L2 cache can track volatile lines and see that it is in-use.
       Core A writes back line and flushes local copy;
         L2 cache then marks the volatile access as complete.
       L2 Cache sends a copy to B
         Via the original request cycling around and hitting L2 again
       Core B writes back line and flushes local copy;
         L2 cache then marks the volatile access as complete.

    Because volatile accesses flush the cached dirty lines immediately,
    this means that there is a performance penalty, but these accesses
    can remain coherent (but without the impact of trying to make all
    memory coherent).


    For something like an inter-processor JIT, this would alas still
    require flushing the L1 caches in a way that is coordinated between
    threads.

    Normally, the mutex mechanism does not include I$ flushes, though one
    possibility could be to have, say, a separate JIT mutex lock, where
    if threads (upon trying to lock a mutex) see a JIT Sequence Number
    that does not match the expected value for that mutex on that
    processor core, it also triggers an I$ flush.

    Say:
       JIT Lock:
         Flush Caches;
         Lock Mutex;
         Increment JIT Sequence Number (JSN).
       Do stuff;
       Flush Caches;
       Unlock Mutex;
         Flush Caches;
         Set mutex to unlocked.





       Lock Mutex (Normal):
         Flush Caches;


    Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire.
    and #LoadStore | #StoreStore for release. No #StoreLoad ordering.


    Cache Flushing on Mutex Lock:
      Anything that was in-memory is now written back;
      Cache is ready to accept new (non-stale data).

    Cache Flush on Mutex Unlock:
      Anything dirty in cache during time mutex was held is now written back;
      ...

    This causes mutex lock/unlock to become a sort of memory ordering event.

    A mutex lock/unlock does not need #StoreLoad ordering. So, keep that in
    mind.





    It is sort of needed for a weak model to work for multi-core multi- threading and not just end up exploding (and some practices will still
    not work as they would on a core with stronger memory ordering and cache coherence).

    Can skip the flushing though in cases where a mutex is being used only
    being used from a single core (since memory is coherent within a core).


         Lock Mutex;
         Check JSN against cores' current JSN;
           If mismatch, flush I$ and update core's JSN.
           Likely all via CPUID and a lookup table, not new arch.
       Do Stuff;
       Unlock Mutex:
         ...
       ...





    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@[email protected] to comp.arch on Tue May 19 20:20:40 2026
    From Newsgroup: comp.arch

    On Fri, 08 May 2026 23:34:21 +0000, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission from
    the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Should an ISA contain an instruction that invalidates (without writing
    back) a Data Cache (or L2) line ?? {Discard}

    I am not sure how to answer such a question.

    After all, programs are written in advance of their being given to a
    computer to execute. So how do you even reference a cache line?

    The best you can do is point the instruction at a memory address, and
    say... _if_ the data at this address is cached, then either invalidate the copy in the cache, or allow the copy in the cache to be updated when this
    data is altered.

    Such instructions can exist. Do they belong in an ISA? Should they be privileged, since they address the system, or, since they're used to
    optimize code, are they hint instructions that ordinary programs need to
    have?

    Here, then, is where the answer to your question is found. Should an ISA
    have these instructions? Yes, _if_ the target machine is such that it
    needs this kind of hinting to help it gain the performance it is capable
    of producing.

    But this is such an obvious answer that you didn't need to ask if that was
    all that you could get; I can't give more.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue May 19 21:04:40 2026
    From Newsgroup: comp.arch


    quadi <[email protected]d> posted:

    On Fri, 08 May 2026 23:34:21 +0000, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission from
    the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Should an ISA contain an instruction that invalidates (without writing back) a Data Cache (or L2) line ?? {Discard}

    I am not sure how to answer such a question.

    After all, programs are written in advance of their being given to a computer to execute. So how do you even reference a cache line?

    The best you can do is point the instruction at a memory address, and
    say... _if_ the data at this address is cached, then either invalidate the copy in the cache, or allow the copy in the cache to be updated when this data is altered.

    Such instructions can exist. Do they belong in an ISA? Should they be privileged, since they address the system, or, since they're used to optimize code, are they hint instructions that ordinary programs need to have?

    Privileged--likely not especially when there are 4 (or more) layers of privilege.

    Should HyperVisor use them in management over SuperVisor?
    Should HyperVisor use them in management over user?
    Should SuperVisor use them in management over User ?
    Should user use them to optimize user stuff ?

    Here, then, is where the answer to your question is found. Should an ISA have these instructions? Yes, _if_ the target machine is such that it
    needs this kind of hinting to help it gain the performance it is capable
    of producing.

    But the gain of prefetching {with a dramatically easier concept}
    has not demonstrated adding much performance at all.

    But this is such an obvious answer that you didn't need to ask if that was all that you could get; I can't give more.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Wed May 20 18:06:33 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    This is what RISC-V/ARM/etc. calls a Clean operation--writeback the data if modified, but keep the line shared (maybe even exclusive).

    The main use for this is on systems with non-coherent instruction fetches
    or non-coherent DMA--it allows forcing data to the next level cache (or DDR) without having to remove it from the dcache, for modifying instructions or
    for updating non-coherent DMA buffers.

    You should not need it, but it shouldn't cost much to have. Although I'll
    note most architectures I've used have had unique bugs with Clean operations, it often interacts in unexpected ways with other coherency traffic.
    Nothing hard to find or hard to fix, but it's a source of some bugs.


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    This is a Bad Idea(TM). You can not let user-level code do this. I am not aware of any architecture that allows user-level code to do Data-Cache- Invalidate. Some have tried, all have had to retract it and at least
    make it a privileged operation.

    The security issue that's hard to fix is around OS calls. The OS can
    create new data for the user, and then the user can Invalidate it and
    peek behind, and see what secrets lie there. This includes things like
    mmap() and allocating new pages through any means, as well as read(),
    recv(), stat(), etc. This is such a leaky area you'll never plug all
    of the holes. You want the OS to not micromanage the cache, so you need
    to not allow the user to peek behind the cache. Yes, the user has write permissions, but the operation is not just a write, it's a go-around-the- cache-and-see-what's-in-memory, and that's quite dangerous.

    Saying it's "privileged only" just means the attack surface changes to the
    OS being able to attack the hypervisor (and so on). Historically, that was often considered acceptable, but I think the OS is less trusted now than 10 years ago.

    Note: if your architecture effectively does Invalidates on stack frames that are no longer valid, that might be fine, but it's still an attack surface.
    If user code can map the stack on top of OS-provided data, and then increment the stack pointer, and you then you invalidate it, you've just given the
    user the same Invalidate operation and it can be used to attack the OS.

    I would think leaving the stack lines would be best--most code will make more calls, and then you don't have to refetch ownership.

    As for terminology, RISC-V and Arm basically agree: Invalidate means the
    cache entry is invalid at the end; Clean means any modified data is written back. So Clean-and-Invalidate is a common cache flush (modified data written back, entry ends being invalid).

    You didn't ask, but Prefetch instructions are rarely useful. I know
    everyone has them, but what works is prefetch prediction hardware.
    Prefetch instructions can also have unique bugs since they don't act
    like regular LD/ST.

    Kent
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Wed May 20 20:27:26 2026
    From Newsgroup: comp.arch

    On 5/14/26 10:57 AM, Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 5/11/26 3:29 AM, Anton Ertl wrote:

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache, but AFAIK Linux actually zeros physical pages ahead of time
    typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    My 66000's memory set instruction is not limited to a page size
    defined when the instruction was generated. IBM's Data Cache
    Block Zero instruction had a compatibility problem when software
    written for early PowerPC caches was to be run on POWER (G5)
    with 128-byte cache blocks.

    ARM has a system register that software can access to determine
    the cache line size for the DC ZVA instruction.


    If one architecturely defines cache block size and page size,
    one is stuck working around that if a different size is better.

    Or, provide a mechanism for the software that performs
    the zeroing to determine both the cache and page sizes
    dynamically.

    That is an alternative, but it requires checking that
    information and acting accordingly. If software only wants to
    zero an aligned 32-byte section, knowing that the DCBZ
    instruction for the implementation on which the software is
    running has a 128-byte cache block is not very useful (other
    than avoiding wrong behavior); a sequence of zero stores would
    have to be used.

    If the zeroing operations are limited to a single function or a
    single library — and the sizes are the same across the system —
    then patching all the function calls to an alternative function
    or loading a different library may be practical.

    Software developers seem often enough to assume such values are
    universal constants, so "working" software can become broken
    software. (This may also be relate to the saying "There is
    nothing more permanent than a temporary fix.")

    These choices have tradeoffs. I tend to favor defining fixed
    size "cache blocks" and working around such with things like
    segmented cache blocks. The diversity in cache block sizes
    (and even page sizes) seems to be somewhat small, so
    Architecting a sub-optimal size may not be too horrible. On the
    other hand, I also like avoiding having hardware handle things
    that could be done by software (which is one reason I like the
    idea of an intermediate software distribution format which can
    be translated to fit the implementation, providing the caching
    of the optimization in the directly executable format while
    allowing the distribution format to be more portable).
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon May 25 01:24:49 2026
    From Newsgroup: comp.arch

    On 5/13/26 9:36 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    [big snip]
    I suspect that HW TM will never take hold of the CPU industry.

    I suspect you are correct. I still think an optimistic
    concurrency mechanism with at least a very large read set could
    be useful.

    [snip]
    (The issue I have with limited optimistic concurrency mechanisms
    like AMD's Advanced Synchronization Facility and My 66000's
    Exotic Synchronization Mechanism is not the initial limits but
    that there seems to be little presentation of an interface that
    can be extended.

    For example:: what ??

    Increasing the size of the transaction would not (I think)
    require any new instructions but it would require documentation.
    The Principles of Operation version that I have seems to imply
    that Architecturally ESM transactions are limited to six 64-byte
    chunks rather than that such is a minimum guaranteed by the
    Architecture (like ASF does).

    I imagined that there might be some use for providing a scope
    identifier for transactions that is denser than a list of
    addresses. This might be used to facilitate ordering
    optimization and perhaps even false conflict avoidance (e.g., a
    large read set might be guarded by a conservative filter based
    on used addresses and by the transaction scope identifier or
    false sharing within a cache block might be ignored). Obviously,
    this would be dangerous, comparable to not acquiring a necessary
    lock, if software got the scope naming wrong.

    Exporting atomic operations also seems not to be discussed.
    While such is technically not Architectural, there are other
    quality of implementation matters that seem important for My
    66000. Knowing that certain idioms will always be recognized
    and optimized to at least a defined degree (which can vary
    across time) is important.

    Quoting Scott Lurndal
    (Message-ID: <qQJPR.1144299$[email protected]>):
    | Functionality guarantees, yes. Performance has to suffer,
    | unless the hardware can analyze all the instructions between
    | the LL/SC and abstract them into a single bus operation; which
    | I don't see as feasible.

    (The virtual vector method has similar issues of the performance
    profile not being clear. Documenting that a vec loop will never
    perform more than 2% worse than any semantically equivalent code
    may not be sufficient if a different algorithm would perform
    50% better on the same hardware and give an acceptable result.
    This is not really Architectural, but benchmarking every
    implementation seems unattractive. Even Intel's early frequency
    throttling for AVX-512 made performance choices for software
    developers more difficult than ideal. I know this is not a
    trivial issue for hardware developers and computer architects,
    but I think it deserves more attention than it seems to get.)

    If my notion of exporting part of a transaction makes sense
    (like atomically incrementing a counter if a complex transaction
    succeeds), how this should be expressed would need to be
    documented (the same way zeroing and nop idioms are defined not
    because of semantics but because of performance).

    Since copying of a page can be atomic (at least for I/O?), it
    seems to me that this could be integrated with ESM.

    Other possible future developments may not belong in an
    Architecture defining document, but having some documentation
    of developmental intent and mere possibilities seems
    potentially beneficial.

    There may also be interactions between ESM and thread scheduling
    and power management. (I do not have even an early version of
    the system documentation for power management and such.) Just as
    Intel introduced the PAUSE instruction in part to save power
    when waiting for a lock to be released, ESM might interact with
    thread priority (which I think you mentioned before) and power
    use.

    There is also a similarity between ESM and x86's MONITOR/MWAIT.
    (The former monitors interference hoping for none so an
    operation can be atomic; the later monitors interference hoping
    for some so that another chunk of work can be started.) I do not
    think My 66000 defined anything like MWAIT.

    I would guess that others would have some ideas for ways that
    ESM might be extended.

    [snip]
    Of course, just as early broad software
    abstractions present the risk of choosing the wrong abstraction
    from lack of experience, having too many exceptional cases, and
    delaying release, an ISA can be designed with excessive
    flexibility that is not exploited much later and has immediate
    costs.)

    That is the problem when you have only been working on it for 22 years----------------alone---------------without feedback

    Sigh. Sometimes I wish I had more computer architecture
    expertise. Even if I could not help develop a better synchronization/communication interface or mechanism, I might at
    least contribute to the state of the art in some way.

    [snip]
    It never ceased to amaze me that Solaris would not boot without a
    real TLM in the simulator. Just referencing all the right mmory
    where the tables were stored (using the CR holding said pointer)
    was not enough--you had to have a TLB with at least 5 FA entries.

    I wish all the experience of people like you was gathered
    together for future generations. Yes, the Computer History
    Museum has a lot of oral histories, and a.f.c. and other parts
    of USENET are probably archived, but it seems a lot of lore
    is lost.

    [snip]
    Mitch considers TM to be a SW problem and My 66000 ISA supports SW
    by allowing multiple lines to participate in a TM transaction,
    without over constraining how SW gets its job done, and with enough
    HW defined behavior that SW can make a robust system with it. Other
    than that TM is a SW problem.

    I agree that general transactional memory is a software problem,
    but I think a lot of aspects can assisted by hardware. E.g., a
    conservative filter of read addresses is harder to do in
    software. Read-Copy-Update methods (which seems to present a
    limited form of versioned memory) may also be amenable to
    hardware assistance of some kind.

    Cliff Click's "IWannaBit!" (2008) opens with:
    | Just One Lousy Bit! I want to know if any memory operation
    | misses or any line in my L1 cache gets evicted. Why? Because
    | with this one Bit I can write any number of lock-free
    | algorithms easily. This Bit gives me an N-word atomic read
    | set, and with a typical Store Conditional instruction a 1-word
    | atomic write set. The algorithm writing community has begged
    | for D-CAS or Hardware Transactional Memory for years, but
    | proposals far out-strip implementations: neither are available
    | on any commodity system. With this Bit I hope to lower the
    | hardware costs as low as possible while still being useful.

    That proposal was in my opinion too small in that it failed a
    transaction on any cache miss (so the cache had to be warmed up
    before a transaction could succeed). At minimum the cache block
    of the starting instruction could be a non-failing cache miss,
    allowing fast single-block atomics. Yet it is more powerful
    than ESM in one very limited way: the capacity of the read set
    can be much larger.

    As a side note, Cliff Click worked at Azul Systems, which
    sold a JAVA-targeted processor that supported transactional
    memory. A lot of software work was required to take software
    counters out of atomic sections because such produced
    interference from the counter being shared, but a lot of
    software was not changed and that hurt the performance of
    transactional memory. With locks, atomic performance counters
    are nearly free; with transactional memory, this design choice
    was sub-optimal.
    --- Synchronet 3.22a-Linux NewsLink 1.2