I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register banks of 16 registers instead of 32. However, it didn't work out; there wasn't enough opcode space for the 16-bit register-to-register
instructions. This surprised me; maybe I can figure something out.
John Savard
"Chris M. Thomasson" <[email protected]> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
John Savard <[email protected]d> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.
Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped
out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
On 12/28/25 6:53 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 12/28/2025 2:04 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
On 12/21/2025 1:21 PM, MitchAlsup wrote:
"Chris M. Thomasson" <[email protected]> posted:
On 12/21/2025 10:12 AM, MitchAlsup wrote:The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
John Savard <[email protected]d> posted:
On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:
For argument setup (calling side) one needs MOV
{R1..R5},{Rm,Rn,Rj,Rk,Rl}
For returning values (calling side) needs MOV {Rm,Rn,Rj},{R1..R3}
For loop iterations needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}
I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.
Since you actually worked at AMD, presumably you know why I'm mistaken
here...
when I read this, I thought that there was a standard technique for >>>>>>>>> doing
stuff like that in a GBOoO machine.
There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.
Just break down all the fancy
instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
cases.
x86 has short/small MOV instructions, Not so with RISCs.
Does your EMS use a so called LOCK MOV? For some damn reason I remember
something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
Oh, and its ESM not EMS. Exotic Synchronization Method.
In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
{A.K.A. a prefix}.
Thanks for the clarification.
On x86/x64 LOCK XADD is a loopless wait free operation.
I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o
So, can your system get wait free semantics for RMW atomics?
A::
ATOMIC-to-Memory-size [address]
ADD Rd,--,#1
Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
containing [address] in the writeable-state the add is performed and
the previous value returned. If [address] is not writeable the cache
line in invalidated and the search continues outward. {This protocol
depends on writeable implying {exclusive or modified} which is typical.} >>>
When [address] reached Memory-Controller it is scheduled in arrival
order, other caches system wide will receive CI, and modified lines
will be pushed back to DRAM-Controller. When CI is "performed" MC/
DRC will perform add #1 to [address] and previous value is returned
as its result.
{{That is the ADD is performed where the data is found in the
memory hierarchy, and the previous value is returned as result;
with all cache-effects and coherence considered.}}
A HW guy would not call this wait free--since the CPU is waiting
until all the nuances get sorted out, but SW will consider this
wait free since SW does not see the waiting time unless it uses
a high precision timer to measure delay.
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
This seems to be similar to branch hints and predication in thatExplain
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The
former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.
If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)
If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).
I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a
programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.
Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"
(I think I may not be communicating well. I am kind of tired--- Synchronet 3.21b-Linux NewsLink 1.2
right now and this topic is complicated.)
Paul Clayton <[email protected]> posted:
Good point. Humm. Well, I just don't want to see the disassembly of
atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)
If you do it LL/SC-style you HAVE to bring data to "this" particular
CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing instructions is a nanosecond time problem.
And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.
This seems to be similar to branch hints and predication in thatExplain
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
A programmer
might use predication to avoid a branch that is expected to be
poorly predicted or to have more consistent execution time. The
My 66000 predication can avoid 2 branches--it operates under the
notion that if FETCH reaches the join point before condition is
known, then predication is always faster than branching.
former could be inappropriate for the computer to obey if the
branch predictor became effective for that branch. If prediction
is accurate, predicate prediction could improve performance but
would break execution time consistency. Even reducing execution
time when the predicate is known early might go against the
programmer's intent by leaking information.
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
--- Synchronet 3.21b-Linux NewsLink 1.2may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a "centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)
I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.
If the contention is from false sharing (having multiple atomic
data in a cache block seems to be considered bad programming
practice, so this should not be common unless cache block size
grows), hardware could theoretically provide special word caches
(or "lossy" block compression where part of the block is dropped)
for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
be preferred if false sharing is less of a problem).
I do not know what the best interface would be, but it seems that
some care should be taken to account for differing intent when a programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
hints and directives can have unexpected performance changes
under different microarchitectures or different usage.
Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"
Instruction scheduling is in effect a optimization for one implementation that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.
Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.
(I think I may not be communicating well. I am kind of tired
right now and this topic is complicated.)
John Savard <[email protected]d> writes:
On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
Or in other words, if you can decode K-instructions per cycle, you'd
better be able to execute K-instructions per cycle--or you have a
serious blockage in your pipeline.
No.
If you flipped "decode" and "execute" in that sentence above, I would 100% >> agree. And maybe this _is_ just a typo.
But if you actually did mean that sentence exactly as written, I would
disagree. This is why: I regard executing instructions as 'doing the
actual work' and decoding instructions as... some unfortunate trivial
overhead that can't be avoided.
It does not matter what "the actual work" is and what isn't. What
matters is how expensive it is to make a particular part wider, and
how paying that cost benefits the IPC. At every step you add width to
the part with the best benefit/cost ratio.
And looking at recent cores, we see that, e.g., Skymont can decode
3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to
functional units (i.e., can execute 26 uops in one cycle); I don't
know how many instructions it can retire per cycle, but I expect that
it is more than 8 per cycle.
So the renamer is the bottleneck, and that's also the idea behind
top-down microarchitecture analysis (TMA) for determining how software interacts with the microarchitecture. That idea is coming out of
Intel, but if Intel is finding it hard to make wider renamers rather
than wider other parts, I expect that the rest of the industry also
finds that hard (especially for architectures where decoding is
cheaper), and (looking at ARM A64) where instructions with more
demands on the renamer exist.
MitchAlsup <[email protected]d> posted:
Paul Clayton <[email protected]> posted:
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.
And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.
This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
Explain
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
If the requesting core has the cache block in exclusive or
modified state, remote execution might be less efficient. Yet it
In My 66000 architecture, an atomic operation on memory will have
the property of being executed where the cache line in a modifiable
state happens to reside. Here you don't want to "bring data in"
nor do you want to "push data out" you want the memory hierarchy
to be perturbed as little as possible WHILE performing the ATOMIC
event.
The cost is that each cache will have an adder--with the sizes of
cache these days, said adder is a trifling in area and seldom used
so it is low power.
may also be possible that the block is expected to be moved to
another core such that this pushing the data manipulation to a
"centralized" location would improve performance as was the
programmer intent (rather than avoiding contention overhead). (I
suppose in My 66000, a programmer would use a push/"prefetch"
instruction to move the cache block to a more central location,
but even that might be sub-optimal if the hardware can predict
the next data consumer such that centrally located would be
slower than shifting the data closer to the consumer.)
I have been betting that (in the general case) software will
remain incapable of doing such predictions, for quite some time.
Clever microarchitecture can make some optimizations sub-optimal
as well as cause programmers to ask "why didn't you do it the way
I told you to do it?!"
Instruction scheduling is in effect a optimization for one implementation
that may not hold for other implementations. In Order pipelines need
instruction scheduling, OoO do not, GBOoO generally perform better if
/when instructions are not (or lesser) scheduled.
Loop unrolling is a case where if your machine has vVM the code
runs faster when not {unrolled and executed with branch prediction}
than when {}.
On 2/5/26 4:27 PM, MitchAlsup wrote:
MitchAlsup <[email protected]d> posted:
Paul Clayton <[email protected]> posted:
[snip]
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?
Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}
I wonder if there is an issue of communicating intention to the
computer. Using atomic-to-memory may be intended to communicate
that the operation is expected to be under contention or that
moderating the impact under high contention is more important
than having a fast "happy path".
There is a speed of light problem here. Communicating across a
computer <system> is a microsecond time problem, whereas executing
instructions is a nanosecond time problem.
And this is exactly where Add-to-Memory gains over Interferable
ATOMIC events--you only pay the latency once, now while the latency
is higher than possible with LL-SC, it is WAY LOWER than worst case
with LL-SC under serious contention.
Yes. Even a single adder would have higher throughput than ping-
ponging a cache block. One might even support a three-or-more
input two-or-more result adder to improve throughtput (or
perhaps exploit usually smaller addends) to increase throughput,
though I suspect there would practically never be a case where a
simple adder would have insufficient throughput.
This seems to be similar to branch hints and predication in that
urging the computer to handle the task in a specific way may not
be optimal for the goal of the user/programmer.
Explain
Branch hints can be intended to reduce branch predictor aliasing
(i.e., assume the static hint is used instead of a dynamic
predictor), to provide agree information, to prefer one path
even if it is less likely, to provide an initialization of the
(per-address component only?) branch predictor, or for some
other motive. The interface/architecture might not be specific
about how such information will be used, especially if it is a
hint (and programmers might disagree about what the best
interface would be). If the interface is not very specific, a microarchitecture might violate a programmer's
desire/expectation by ignoring the hint or using it in a
different way.
Similarly predication can be motivated to avoid fetch
redirection (initial ARM and My 66000), to facilitate constant
time execution, to avoid the performance cost of branch
mispredictions, or perhaps for some reason that does not come to
mind. Predicate prediction would foil constant time execution
and might reduce performance (or merely introduce weird
performance variation). Even the fetch optimization might be
undone if the hardware discovers that the condition is extremely
biased and folds out the rarely used instructions; which would
be good for performance if the bias continues, but if the bias
changes just frequently enough it could hurt performance.
[snip]
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
Except that prediction could violate the time constancy assumed
by the programmer.
Paul Clayton <[email protected]> posted:
On 2/5/26 4:27 PM, MitchAlsup wrote:
MitchAlsup <[email protected]d> posted:
Paul Clayton <[email protected]> posted:
[snip]
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?
Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.
[snip]
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
Except that prediction could violate the time constancy assumed
by the programmer.
Time constancy is provided by execution both then clause and else clause
and using CMOV to decide on flow.
On 2/9/26 2:28 PM, MitchAlsup wrote:
This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so. (I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)
On 2/9/26 2:28 PM, MitchAlsup wrote:
Paul Clayton <[email protected]> posted:
On 2/5/26 4:27 PM, MitchAlsup wrote:
MitchAlsup <[email protected]d> posted:
Paul Clayton <[email protected]> posted:
[snip]
LL-op-SC could be recognized as an idiom and avoid bringing data
to the core.
Can recognize:
LDL Rd,[address]
ADD Rd,Rd,#whatever
STC Rd,[address]
Cannot recognize:
LDA R1,[address]
CALL LoadLocked
ADD R2,R2,#whatever
CALL StoreConditional
When would one want to decouple LL and SC into function calls
away from the computation? Perhaps for in-place software
instrumenation?
Write, in pure K&R C, the functionality for LoadLocked and StoreConditional.
Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).
[snip]
[snip]
There is no reason not to predict My 66000-style predication,
nor is there any great desire/need TO predict them, either.
Except that prediction could violate the time constancy assumed
by the programmer.
Time constancy is provided by execution both then clause and else clause and using CMOV to decide on flow.
This means ordinary My 66000 predication cannot be used for
such. I have no idea whether writers of cryptographic software
would be upset that a mechanism which seems like it would
provide constant time is documented not to do so.
(I would
_guess_ that most of the embedded strong guarantee real time
software might select hardware that never does predict
predication since caches and similar general purpose
optimizations tend to increase the difficulty of providing
strong timing guarantees.)
Is My 66000 going to have CMOV?
What about something like some
of the AArch64 simple conditional operations? (Yes, such gets
un-RISCy both from count of defined instructions and instruction
complexity, but I have read that the supported operations are
somewhat common for single instruction hammock branches and are
easier to fit microarchitecturally and to fit within a 32-bit
instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
encoding (i.e., vagueries of idiom recognition — though an
encoded instruction can be implemented with performance that
breaks expectations) are disadvantages of just using PRED, but
PRED also removes artificial restrictions like only incrementing
by 1 (such that idiom recognition could be implemented to fast
path any conditional add_immediate).
One interesting case for predication is code like the following:
if (a < 25) {a += 8;}
The predicate is known to be available at least early enough to
nullify the addition executed in parallel with the compare
because the register operand is the same and the only other
operands are immediates. It is not clear if this presents a
useful optimization opportunity, but it seems an interesting
case.
I am also curious about what you view as the trade-offs of
conditional move compared with conditional select. Obviously
conditional select potentially includes a register copy and for
typical OoO implementations the mechanism is similar because a
conditional move renames the destination and reads the old value
and the alternative value. Encoding the condition source may be
harder for conditional select (e.g., test register A is/is not
zero, based on test place register B or register C in register
D, which requires four register names while conditional move
exploits that one of the sources is the same as the
destination); for My 66000 a generalized condition after a
comparison would, I think, require a 6-bit condition bit
specifier and a register source for the condition, which just
fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
specifier, four 5-bit register names).
Paul Clayton <[email protected]> posted:<snip>
On 2/9/26 2:28 PM, MitchAlsup wrote:
Write, in pure K&R C, the functionality for LoadLocked and
StoreConditional.
Then why would the compiler not inline such? It seems reasonable
(to me) to blame poor scaling performance in that case on the
compiler which did not inline a one instruction function (or on
the developer who intentionally disabled such optimization).
Because in pure K&R C there is no concept of atomic things, and
thus one has to resort to ASM--beyond this, there is no concept
of inlining.
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
According to Scott Lurndal <[email protected]>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
According to Scott Lurndal <[email protected]>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
work the normal way.
John Levine wrote:
According to Scott Lurndal <[email protected]>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
We knew it was possible to do runtime code generation, since it's an ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal prefix.
This could well be faster than a complicated sort operator, even if it
was run-time generated, particularly since that runtime code would need
to be called.
Terje
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was >trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal >prefix.
According to Terje Mathisen <[email protected]>:
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal
prefix.
I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
See, for example:
https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
According to Terje Mathisen <[email protected]>:
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract >>the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was >>trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data >>and accept that the full sort would only deliver a partially sorted >>results, so that I had to do a final sort pass over each chunk of equal >>prefix.
I'm pretty sure you'll find all of those in sort programs from the 1960s. >Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
I'm pretty sure you'll find all of those in sort programs from the 1960s. >>Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.
Although large sorts in those days were done using magnetic tapes.
(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).
According to Scott Lurndal <[email protected]>:
I'm pretty sure you'll find all of those in sort programs from the 1960s. >>>Sorting used more computer time than anything else so they put a great >>>deal of work into making it fast.
Indeed, the sort subsystem on the Burroughs systems was, perhaps,
even more important that the MCP itself to a lot of customers.
Although large sorts in those days were done using magnetic tapes.
(watching a large sort-merge running on a dozen 9-track drives
was pretty impressive).
Yup. I suspect that the read backward feature on tape drives was invented to >speed up sorting,
I suspect that the read backward feature on tape drives was
invented to speed up sorting, when someone realized that if
you'd written a set of tapes, you could read them backward
and merge into the next set of tapes in reverse order, repeat
until done.
My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.
In article <10ml9ef$264n$[email protected]>, [email protected] (John Levine) >wrote:
I suspect that the read backward feature on tape drives was
invented to speed up sorting, when someone realized that if
you'd written a set of tapes, you could read them backward
and merge into the next set of tapes in reverse order, repeat
until done.
Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. ><https://www.amazon.co.uk/dp/0262517205>
It appears that Terje Mathisen <[email protected]> said:
My "Algorithms and Data Structures" professor told us that IBM had >patented a zero-time sort chip, which they then promptly buried since >their internal stats said that over all IBM mainframes, 60% of the CPU >hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.
Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.
For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.
These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
According to Terje Mathisen <[email protected]>:
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.
The final option was to extract a short prefix (8-bytes) of the key data
and accept that the full sort would only deliver a partially sorted
results, so that I had to do a final sort pass over each chunk of equal
prefix.
I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
deal of work into making it fast.
See, for example:
https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
Terje Mathisen <[email protected]> posted:
John Levine wrote:
According to Scott Lurndal <[email protected]>:
I do believe that it would be possible in K&R C.
Create a static array of bytes with the appropriate
object code to perform the operation followed
by a RET instruction. Cast the address
of the array to a function pointer and call it.
Or take the address of a branch label in an unreachable
point in a function (e.g. after a return), overwrite
the text at that label with the necessary object code
then branch to it (text was generally RW in the K&R days,
eg on the PDP-11).
Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
/44, user programs had separate instuction and data mappings. The
smaller ones, the 11/34, /40, and /60 didn't.
We knew it was possible to do runtime code generation, since it's an
ancient technique widely used by sort programs to speed up the inner
comparison loop, but we had enough trouble getting our programs to
work the normal way.
40 years ago my own sort functions/programs would instead either extract
the compound keys into a form that could be compared directly, or
re-write (with a reversible transform) the data in place, so that it was
trivially comparable.
40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications
and machines (i.e., a virus).
John Levine <[email protected]> posted:
It appears that Terje Mathisen <[email protected]> said:
My "Algorithms and Data Structures" professor told us that IBM hadSounds like an urban legend to me. For one thing, on machines of that
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was >>> full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.
I remember hearing about the chip (circa 1986-8).
I agree that the number of entries would not have been enough to justify.
I also agree that Interrupt response time after the I/O would have
prevented any useful performance advantage.
By the mid-1990s the entry count could have been large enough to justify.
For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.
These days z/Series has the optional Enhanced-Sort Facility added in
September 2020 which adds a complex SORT LISTS instruction that does
in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
40 years ago, Duff's device was considered a reasonable way to improve
performance. It is now considered a means to take over applications
Was it ever that?
Yes, it was a very compact way to express the same logic you would use
in asm to jump into the correct spot of an unrolled loop, but did it >actually save any time compared to a plain unroll followed by a final fixup?
Duff would never compile into anything strange, but as a way to
obfuscate it could work I guess? Absolutley no need to flag it as >malware/virus!
40 years ago, Duff's device was considered a reasonable way to improve >performance. It is now considered a means to take over applications
and machines (i.e., a virus).
On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <[email protected]d> wrote:
40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
and machines (i.e., a virus).
I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
steps succeed.
I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
George Neuner <[email protected]> writes:
On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <[email protected]d> wrote:
----------------------------------------------------------------40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
and machines (i.e., a virus).
I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup steps succeed.
I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:
switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}
or is you tears down complex enough to use such things?
Yours,
George Neuner <[email protected]> writes:
On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
<[email protected]d> wrote:
40 years ago, Duff's device was considered a reasonable way to improve >>>performance. It is now considered a means to take over applications
and machines (i.e., a virus).
I don't know that it ever really improved performance, but it still is
useful. I use the switch() form for backing out of situations having
complicated setup where processing isn't possible unless all the setup
steps succeed.
I keep track of how many setup steps successfully complete, and then,
whatever the exit situation, I jump into a Duff switch that tears it
all down in reverse order of the setup.
This seems missing an important aspect of Duff switch: the use of the >possibility to put case labels inside control structure. Duff's device use
a while starting before the first label, but you can do other strange
things like jumping into an if:
switch(i) {
case 1:
if (j == 3) {
case 2:
printf("i == 2 || (i == 1 && j == 3)\n");
}
}
or is you tears down complex enough to use such things?
Yours,
These days z/Series has the optional Enhanced-Sort Facility added in >September 2020 which adds a complex SORT LISTS instruction that does >in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
It appears that Terje Mathisen <[email protected]> said:
My "Algorithms and Data Structures" professor told us that IBM had
patented a zero-time sort chip, which they then promptly buried since
their internal stats said that over all IBM mainframes, 60% of the CPU
hours were spent in various forms of sorting.
Terje
PS. The chip was a ladder of comparators, so that you could use a
channel to stream data into it, then when you were done (or the chip was
full) you would run the channel transfer in the opposite direction to
retrieve the sorted data.
Sounds like an urban legend to me. For one thing, on machines of that
era no useful sort fit in core so it spent most of its time doing disk
and tape I/O anyway.
For another, if it really did make a difference, they would have
called it the Improved Sort Performance Facility and sold it as an
extra cost option.
These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
memory speed.
John Levine <[email protected]> writes:
These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.
If the microarchitecture is worth anything, microcode does not run any
faster than architectural code, when doing the same operations.
One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware
feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?
According to Anton Ertl <[email protected]>:
John Levine <[email protected]> writes:
These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.
If the microarchitecture is worth anything, microcode does not run any >faster than architectural code, when doing the same operations.
One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware >feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?
It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
a while so it might be that the microcode can run the memory at full speed better
than ordinary instrucions can.
https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15
This isn't their first sort speedup. A while ago they added instructions that do
the inner loop of heapsort. Not sure whether there's hardware for that.
One possible reason for the additional instruction is that there is a >>hardware feature that provides a speedup; in that case the hardware >>feature, not the microcode is the reason for the speedup. Is there
such a hardware feature for SORT LISTS, and if so, what is it?
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,099 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492379:03:46 |
| Calls: | 14,106 |
| Calls today: | 2 |
| Files: | 187,124 |
| D/L today: |
2,542 files (1,098M bytes) |
| Messages: | 2,496,242 |