On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say “self-modifying code” as soon as I
mentioned this, even though the two are quite different things.
I think it’s quite desirable that an architecture guarantees that an
(to coin a phrase) “instruction view” versus a “data view” of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost effective".
It is nicer, say, to have I$/D$ coherence, and to not require explicit flushing and invalidation. Or, say, to have caches that are implicitly coherent between threads (Core A stores to a location, Core B loads from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer bandwidth on the bus.
Say, typical scenario with incoherent caches:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache sends a copy to B.
A and B now have incoherent copies.
Versus Say:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache rejects B's Request;
L2 Cache sense a request to A to write line back;
Core A writes line back (flushing it locally);
(Maybe) L2 signals to Core B that the line is now available.
Core B Requests Line again (retry);
L2 Cache sends a copy to B.
In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
Core A requests a line for Volatile Write;
Core B Requests Line (also for Volatile Write);
L2 Cache sends a copy to A;
L2 Cache ignores B's Request (it can cycle the ring some more);
L2 cache can track volatile lines and see that it is in-use.
Core A writes back line and flushes local copy;
L2 cache then marks the volatile access as complete.
L2 Cache sends a copy to B
Via the original request cycling around and hitting L2 again
Core B writes back line and flushes local copy;
L2 cache then marks the volatile access as complete.
Because volatile accesses flush the cached dirty lines immediately, this means that there is a performance penalty, but these accesses can remain coherent (but without the impact of trying to make all memory coherent).
For something like an inter-processor JIT, this would alas still require flushing the L1 caches in a way that is coordinated between threads.
Normally, the mutex mechanism does not include I$ flushes, though one possibility could be to have, say, a separate JIT mutex lock, where if threads (upon trying to lock a mutex) see a JIT Sequence Number that
does not match the expected value for that mutex on that processor core,
it also triggers an I$ flush.
Say:
JIT Lock:
Flush Caches;
Lock Mutex;
Increment JIT Sequence Number (JSN).
Do stuff;
Flush Caches;
Unlock Mutex;
Flush Caches;
Set mutex to unlocked.
Lock Mutex (Normal):
Flush Caches;
Lock Mutex;
Check JSN against cores' current JSN;
If mismatch, flush I$ and update core's JSN.
Likely all via CPUID and a lookup table, not new arch.
Do Stuff;
Unlock Mutex:
...
...
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.
Stefan Monnier <[email protected]> writes:
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of >branch-recovery if that memory location is later modified is a way that >changes what the LD should have received?
Of course it can, although I would not call it "branch" recovery.
The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious
to me that it would necessarily be impractical.
Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core communication. E.g., I see at <https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.
Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.
Concerning performance costs, whenever a conflict is detected, one way
of recovery would be to reset all cores to the architectural state of
the last snapshot before the conflict happened.
One can probably find
less draconic ways to ensure consistency, but I consider them to be optimizations. One optimization might be to predict the conflict and
hold back the corresponding load such that no conflict happens and no
reset is necessary.
Another might be to find out which cores
communicate, and only reset those that have talked to each other since
the snapshot.
- anton--- Synchronet 3.22a-Linux NewsLink 1.2
[email protected] (Anton Ertl) posted:
Stefan Monnier <[email protected]> writes:
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of
branch-recovery if that memory location is later modified is a way that
changes what the LD should have received?
Of course it can, although I would not call it "branch" recovery.
The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.
Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as
a failure. How does one recover that ??
So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.
Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core
communication. E.g., I see at
<https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.
And that only recovers the state, not the intent of the state (above).
Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.
Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually
Consistent machine.
MitchAlsup wrote:
[email protected] (Anton Ertl) posted:
Stefan Monnier <[email protected]> writes:
Consider a GBOoO machine under sequential consistency, a LD which
can have its address calculated early cannot leave the CPU area
until all older stores currently in flight have left the CPU area.
This would dramatically add to L1 cache miss latency, and would
add moderately to L1 cache hit latency.
Can't the GBOoO send the LD out early/speculatively, and do a kind of
branch-recovery if that memory location is later modified is a way that >>> changes what the LD should have received?
Of course it can, although I would not call it "branch" recovery.
The person you cited without attribution (to protect the guilty?)
exhibits what I called the laziness of hardware designers: Instead of
thinking how to implement sequential consistency efficiently, they
think about rationalizations for not doing so.
Of course, that too comes with a cost (that of keeping track of all
those memory accesses that may have to be re-done), but it's not obvious >>> to me that it would necessarily be impractical.
Consider the case where the speculative LD is interfering with another CPUs ATOMIC LL/SC sequence, grabbing write permission, and sending the SC off as a failure. How does one recover that ??
So, yes, you can recover this CPU's state, but no, you cannot precisely recover the other CPU's state precisely.
Yes, the whole architectural state of the core would have to be reset.
The major challenge for using the classical implementation of
speculative execution (with, register renaming, speculative store
buffer, and reorder buffer) is the worst-case latency of inter-core
communication. E.g., I see at
<https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile> that
the Ryzen AI 9 HX 370 has latencies of 180ns<l<190ns, and the
(multi-die) 3950x has latencyes of 80ns<l<90ns. I have read that with
newer firmware, the latency of Zen5-based CPUs gets down to the 80ns
range, and I expect that if an architecture provides sequential
consistency, there are more incentives to bring that latency number
down. OTOH, with multi-socket machines, the latency tends to be
higher. Anyway, let's work with the 90ns number. That's about 500
cycles at the higher Zen5 clock rates, and is 4000 potential
instruction slots; the Zen5 ROB only has 448 entries, so one probably
will not extend the ROB approach to deal with sequential consistency.
A snapshot-and-recovery mechanism might work, based on epochs on the
order of the maximum communication latency.
And that only recovers the state, not the intent of the state (above).
Then we have to think about how to prevent (not mitigate) Spectre for
such a mechanism; yes, hardware designers currently don't do anything
about preventing Spectre, and they probably will not do anything if
they ever implement sequential consistency, but I think they should,
and so I also think that one needs a way to implement sequential
consistency efficiently that can be combined with an efficient
prevention of Spectre. Note how speculative side channel attacks were
the final death sentence for TSX.
Given that ST to LD ordering is an inherent part of SC, a SC machine
will not be able to use as large an execution window as a Casually Consistent machine.
Casually -> Causally ?
Terje
On 5/14/2026 10:22 AM, BGB wrote:
On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require
that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say “self-modifying code” as soon as I
mentioned this, even though the two are quite different things.
I think it’s quite desirable that an architecture guarantees that an
(to coin a phrase) “instruction view” versus a “data view” of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost
effective".
It is nicer, say, to have I$/D$ coherence, and to not require explicit
flushing and invalidation. Or, say, to have caches that are implicitly
coherent between threads (Core A stores to a location, Core B loads
from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends on
the cache-coherent behavior, and much more than 2 or 4 cores, it is
not difficult to imagine scenarios where waiting on cache-coherence
mechanisms becomes a more significant cost than actual memory-transfer
bandwidth on the bus.
Say, typical scenario with incoherent caches:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache sends a copy to B.
A and B now have incoherent copies.
Versus Say:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache rejects B's Request;
L2 Cache sense a request to A to write line back;
Core A writes line back (flushing it locally);
(Maybe) L2 signals to Core B that the line is now available.
Core B Requests Line again (retry);
L2 Cache sends a copy to B.
In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
Core A requests a line for Volatile Write;
Core B Requests Line (also for Volatile Write);
L2 Cache sends a copy to A;
L2 Cache ignores B's Request (it can cycle the ring some more);
L2 cache can track volatile lines and see that it is in-use.
Core A writes back line and flushes local copy;
L2 cache then marks the volatile access as complete.
L2 Cache sends a copy to B
Via the original request cycling around and hitting L2 again
Core B writes back line and flushes local copy;
L2 cache then marks the volatile access as complete.
Because volatile accesses flush the cached dirty lines immediately,
this means that there is a performance penalty, but these accesses can
remain coherent (but without the impact of trying to make all memory
coherent).
For something like an inter-processor JIT, this would alas still
require flushing the L1 caches in a way that is coordinated between
threads.
Normally, the mutex mechanism does not include I$ flushes, though one
possibility could be to have, say, a separate JIT mutex lock, where if
threads (upon trying to lock a mutex) see a JIT Sequence Number that
does not match the expected value for that mutex on that processor
core, it also triggers an I$ flush.
Say:
JIT Lock:
Flush Caches;
Lock Mutex;
Increment JIT Sequence Number (JSN).
Do stuff;
Flush Caches;
Unlock Mutex;
Flush Caches;
Set mutex to unlocked.
Lock Mutex (Normal):
Flush Caches;
Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire. and #LoadStore | #StoreStore for release. No #StoreLoad ordering.
Lock Mutex;
Check JSN against cores' current JSN;
If mismatch, flush I$ and update core's JSN.
Likely all via CPUID and a lookup table, not new arch.
Do Stuff;
Unlock Mutex:
...
...
On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:-------------------Why can't BGB clip unnecessary lines in the thread ???
On 5/14/2026 10:22 AM, BGB wrote:
On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
Cache Flushing on Mutex Lock:
Anything that was in-memory is now written back;
Cache is ready to accept new (non-stale data).
Cache Flush on Mutex Unlock:
Anything dirty in cache during time mutex was held is now written back;
...
This causes mutex lock/unlock to become a sort of memory ordering event.
It is sort of needed for a weak model to work for multi-core
multi-threading and not just end up exploding (and some practices will
still not work as they would on a core with stronger memory ordering and cache coherence).
On 5/15/2026 4:13 PM, Chris M. Thomasson wrote:
On 5/14/2026 10:22 AM, BGB wrote:
On 5/13/2026 2:02 AM, Lawrence D’Oliveiro wrote:
On Mon, 11 May 2026 06:07:42 GMT, Anton Ertl wrote:
There are "architectures" like Power where "data memory" and
"instruction memory" are not coherent, even when they are the same
memory.
Also the Motorola 68040.
Upon updating instructions (e.g., from a JIT compiler), they require >>>>> that the modifying thread(s) write the lines back from the data
cache to a shared cache or main memory, and that the executing
threads invalidate these cache lines and flush their pipeline. I
think that that's a bad idea, not just because it exposes
microarchitectural concepts like cache and pipeline to the
architecture, and leads to unpredictable results in some usage
scenarios (see my signature), but also because the requirements on
the executing threads are extremely difficult to meet if the
executing threads run independently of the modifying thread(s). Or,
in short, IA-32 and AMD64 did the right architecture for that.
One technique for implementing lexical binding and functions as
first-class objects involves generating code at run-time. Some people
would immediately gasp and say “self-modifying code” as soon as I
mentioned this, even though the two are quite different things.
I think it’s quite desirable that an architecture guarantees that an >>>> (to coin a phrase) “instruction view” versus a “data view” of the same
memory location will never show different values.
Sometimes, there is a difference between "nice to have", vs "cost
effective".
It is nicer, say, to have I$/D$ coherence, and to not require
explicit flushing and invalidation. Or, say, to have caches that are
implicitly coherent between threads (Core A stores to a location,
Core B loads from that location, Core B sees what Core A stored).
The requirements to pull all this off in practice may add significant
costs; and also in ways where the performance cost of the coherence
mechanisms tend to scale upwards as core counts increase.
Say, for example, if one has coherent caches, software that depends
on the cache-coherent behavior, and much more than 2 or 4 cores, it
is not difficult to imagine scenarios where waiting on cache-
coherence mechanisms becomes a more significant cost than actual
memory-transfer bandwidth on the bus.
Say, typical scenario with incoherent caches:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache sends a copy to B.
A and B now have incoherent copies.
Versus Say:
Core A Requests Line (for Write);
Core B Requests Line (also for Write);
L2 Cache sends a copy to A;
L2 Cache rejects B's Request;
L2 Cache sense a request to A to write line back;
Core A writes line back (flushing it locally);
(Maybe) L2 signals to Core B that the line is now available.
Core B Requests Line again (retry);
L2 Cache sends a copy to B.
In my approach, I went with incoherent caches, but with a special
Volatile mechanism for some cases, say:
Core A requests a line for Volatile Write;
Core B Requests Line (also for Volatile Write);
L2 Cache sends a copy to A;
L2 Cache ignores B's Request (it can cycle the ring some more);
L2 cache can track volatile lines and see that it is in-use.
Core A writes back line and flushes local copy;
L2 cache then marks the volatile access as complete.
L2 Cache sends a copy to B
Via the original request cycling around and hitting L2 again
Core B writes back line and flushes local copy;
L2 cache then marks the volatile access as complete.
Because volatile accesses flush the cached dirty lines immediately,
this means that there is a performance penalty, but these accesses
can remain coherent (but without the impact of trying to make all
memory coherent).
For something like an inter-processor JIT, this would alas still
require flushing the L1 caches in a way that is coordinated between
threads.
Normally, the mutex mechanism does not include I$ flushes, though one
possibility could be to have, say, a separate JIT mutex lock, where
if threads (upon trying to lock a mutex) see a JIT Sequence Number
that does not match the expected value for that mutex on that
processor core, it also triggers an I$ flush.
Say:
JIT Lock:
Flush Caches;
Lock Mutex;
Increment JIT Sequence Number (JSN).
Do stuff;
Flush Caches;
Unlock Mutex;
Flush Caches;
Set mutex to unlocked.
Lock Mutex (Normal):
Flush Caches;
Huh? Mutex lock/unlock only need #LoadStore | #LoadLoad for acquire.
and #LoadStore | #StoreStore for release. No #StoreLoad ordering.
Cache Flushing on Mutex Lock:
Anything that was in-memory is now written back;
Cache is ready to accept new (non-stale data).
Cache Flush on Mutex Unlock:
Anything dirty in cache during time mutex was held is now written back;
...
This causes mutex lock/unlock to become a sort of memory ordering event.
It is sort of needed for a weak model to work for multi-core multi- threading and not just end up exploding (and some practices will still
not work as they would on a core with stronger memory ordering and cache coherence).
Can skip the flushing though in cases where a mutex is being used only
being used from a single core (since memory is coherent within a core).
Lock Mutex;
Check JSN against cores' current JSN;
If mismatch, flush I$ and update core's JSN.
Likely all via CPUID and a lookup table, not new arch.
Do Stuff;
Unlock Mutex:
...
...
Should an ISA contain an instruction that gives Write-Permission from
the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without writing
back) a Data Cache (or L2) line ?? {Discard}
On Fri, 08 May 2026 23:34:21 +0000, MitchAlsup wrote:
Should an ISA contain an instruction that gives Write-Permission from
the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without writing back) a Data Cache (or L2) line ?? {Discard}
I am not sure how to answer such a question.
After all, programs are written in advance of their being given to a computer to execute. So how do you even reference a cache line?
The best you can do is point the instruction at a memory address, and
say... _if_ the data at this address is cached, then either invalidate the copy in the cache, or allow the copy in the cache to be updated when this data is altered.
Such instructions can exist. Do they belong in an ISA? Should they be privileged, since they address the system, or, since they're used to optimize code, are they hint instructions that ordinary programs need to have?
Here, then, is where the answer to your question is found. Should an ISA have these instructions? Yes, _if_ the target machine is such that it
needs this kind of hinting to help it gain the performance it is capable
of producing.
But this is such an obvious answer that you didn't need to ask if that was all that you could get; I can't give more.--- Synchronet 3.22a-Linux NewsLink 1.2
John Savard
Should an ISA contain an instruction that gives Write-Permission
from the Data Cache (or L2) line outward in the memory hierarchy,
while keeping the <now shared> line resident ?? {allow}
Should an ISA contain an instruction that invalidates (without
writing back) a Data Cache (or L2) line ?? {Discard}
Paul Clayton <[email protected]> writes:
On 5/11/26 3:29 AM, Anton Ertl wrote:
A better approach is to do just the writes. I think that zeroing the
page on demand is a good approach, because then it is already in the
D-cache, but AFAIK Linux actually zeros physical pages ahead of time
typically on a separate (otherwise idle) core, and just maps one of
those pages to the virtual page that needs to be written to. I wonder
why Linux does that.
Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
and the interconnect is designed to transport the page zero in one
transaction.
This is more flexible than having cache line and page clearing
instructions.
In what way is it more flexible? It is a page-clearing instruction.
My 66000's memory set instruction is not limited to a page size
defined when the instruction was generated. IBM's Data Cache
Block Zero instruction had a compatibility problem when software
written for early PowerPC caches was to be run on POWER (G5)
with 128-byte cache blocks.
ARM has a system register that software can access to determine
the cache line size for the DC ZVA instruction.
If one architecturely defines cache block size and page size,
one is stuck working around that if a different size is better.
Or, provide a mechanism for the software that performs
the zeroing to determine both the cache and page sizes
dynamically.
Paul Clayton <[email protected]> posted:
I suspect that HW TM will never take hold of the CPU industry.
(The issue I have with limited optimistic concurrency mechanisms
like AMD's Advanced Synchronization Facility and My 66000's
Exotic Synchronization Mechanism is not the initial limits but
that there seems to be little presentation of an interface that
can be extended.
For example:: what ??
Of course, just as early broad software
abstractions present the risk of choosing the wrong abstraction
from lack of experience, having too many exceptional cases, and
delaying release, an ISA can be designed with excessive
flexibility that is not exploited much later and has immediate
costs.)
That is the problem when you have only been working on it for 22 years----------------alone---------------without feedback
It never ceased to amaze me that Solaris would not boot without a
real TLM in the simulator. Just referencing all the right mmory
where the tables were stored (using the CR holding said pointer)
was not enough--you had to have a TLB with at least 5 FA entries.
Mitch considers TM to be a SW problem and My 66000 ISA supports SW
by allowing multiple lines to participate in a TM transaction,
without over constraining how SW gets its job done, and with enough
HW defined behavior that SW can make a robust system with it. Other
than that TM is a SW problem.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,123 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 35:35:16 |
| Calls: | 14,371 |
| Files: | 186,380 |
| D/L today: |
1,555 files (469M bytes) |
| Messages: | 2,540,636 |