• Re: Accepting the Sense of Some of Mitch Alsup's Advice

    From John Savard@[email protected] to comp.arch on Sat Jan 31 18:14:31 2026
    From Newsgroup: comp.arch

    I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register
    banks of 16 registers instead of 32. However, it didn't work out; there
    wasn't enough opcode space for the 16-bit register-to-register
    instructions. This surprised me; maybe I can figure something out.

    John Savard

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Feb 4 20:36:52 2026
    From Newsgroup: comp.arch


    John Savard <[email protected]d> posted:

    I attempted to construct an architecture with ordinary variable-length instructions, based on the Concertina II architecture, but with register banks of 16 registers instead of 32. However, it didn't work out; there wasn't enough opcode space for the 16-bit register-to-register
    instructions. This surprised me; maybe I can figure something out.

    In my honest opinion, to get 16-bit instructions to work/help/add-value
    you have to sacrifice much elsewhere in your ISA.

    Whether you can get it to work is a semblance of your balance.

    John Savard

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Wed Feb 4 21:32:58 2026
    From Newsgroup: comp.arch

    On 12/28/25 6:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <[email protected]d> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)   needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations                   needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

                                         Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember >>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped
    out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.} >>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer. A programmer
    might use predication to avoid a branch that is expected to be
    poorly predicted or to have more consistent execution time. The
    former could be inappropriate for the computer to obey if the
    branch predictor became effective for that branch. If prediction
    is accurate, predicate prediction could improve performance but
    would break execution time consistency. Even reducing execution
    time when the predicate is known early might go against the
    programmer's intent by leaking information.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it
    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a
    "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    If the contention is from false sharing (having multiple atomic
    data in a cache block seems to be considered bad programming
    practice, so this should not be common unless cache block size
    grows), hardware could theoretically provide special word caches
    (or "lossy" block compression where part of the block is dropped)
    for moderating the impact of false sharing. This would change the
    optimization preferences for the program (more compact data might
    be preferred if false sharing is less of a problem).

    I do not know what the best interface would be, but it seems that
    some care should be taken to account for differing intent when a
    programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
    hints and directives can have unexpected performance changes
    under different microarchitectures or different usage.

    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    (I think I may not be communicating well. I am kind of tired
    right now and this topic is complicated.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Feb 5 18:49:39 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 12/28/25 6:53 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 12/28/2025 2:04 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:
    On 12/21/2025 1:21 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <[email protected]> posted:

    On 12/21/2025 10:12 AM, MitchAlsup wrote:

    John Savard <[email protected]d> posted:

    On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:

    For argument setup (calling side) one needs MOV
    {R1..R5},{Rm,Rn,Rj,Rk,Rl}
    For returning values (calling side)   needs MOV {Rm,Rn,Rj},{R1..R3}

    For loop iterations                   needs MOV {Rm,Rn,Rj},{Ra,Rb,Rc}

    I just can't see how to make these run reasonably fast within the >>>>>>>>>> constraints of the GBOoO Data Path.

    Since you actually worked at AMD, presumably you know why I'm mistaken
    here...

    when I read this, I thought that there was a standard technique for >>>>>>>>> doing
    stuff like that in a GBOoO machine.

    There is::: it is called "load 'em up, pass 'em through". That is no >>>>>>>> different than any other calculation, except that no mangling of the >>>>>>>> bits is going on.

                                         Just break down all the fancy
    instructions into RISC-style pseudo-ops. But apparently, since you >>>>>>>>> would
    know all about that, there must be a reason why it doesn't apply in >>>>>>>>> these
    cases.

    x86 has short/small MOV instructions, Not so with RISCs.

    Does your EMS use a so called LOCK MOV? For some damn reason I remember
    something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect.. >>>>>>
    The 2-operand+displacement LD/STs have a lock bit in the instruction-- >>>>>> that
    is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant. >>>>>>
    Oh, and its ESM not EMS. Exotic Synchronization Method.

    In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-Modifier
    {A.K.A. a prefix}.

    Thanks for the clarification.

    On x86/x64 LOCK XADD is a loopless wait free operation.

    I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless >>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC >>>> "style" loop, well, that causes damage to my loopless claim... ;^o

    So, can your system get wait free semantics for RMW atomics?

    A::

    ATOMIC-to-Memory-size [address]
    ADD Rd,--,#1

    Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is
    performed and line updated. Otherwise, the Add-to-memory #1 is shipped >>> out over the memory hierarchy. When the operation runs into a cache
    containing [address] in the writeable-state the add is performed and
    the previous value returned. If [address] is not writeable the cache
    line in invalidated and the search continues outward. {This protocol
    depends on writeable implying {exclusive or modified} which is typical.} >>>
    When [address] reached Memory-Controller it is scheduled in arrival
    order, other caches system wide will receive CI, and modified lines
    will be pushed back to DRAM-Controller. When CI is "performed" MC/
    DRC will perform add #1 to [address] and previous value is returned
    as its result.

    {{That is the ADD is performed where the data is found in the
    memory hierarchy, and the previous value is returned as result;
    with all cache-effects and coherence considered.}}

    A HW guy would not call this wait free--since the CPU is waiting
    until all the nuances get sorted out, but SW will consider this
    wait free since SW does not see the waiting time unless it uses
    a high precision timer to measure delay.

    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing
    instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.
    Explain
    A programmer
    might use predication to avoid a branch that is expected to be
    poorly predicted or to have more consistent execution time. The

    My 66000 predication can avoid 2 branches--it operates under the
    notion that if FETCH reaches the join point before condition is
    known, then predication is always faster than branching.

    former could be inappropriate for the computer to obey if the
    branch predictor became effective for that branch. If prediction
    is accurate, predicate prediction could improve performance but
    would break execution time consistency. Even reducing execution
    time when the predicate is known early might go against the
    programmer's intent by leaking information.

    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it
    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a
    "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    I have been betting that (in the general case) software will
    remain incapable of doing such predictions, for quite some time.

    If the contention is from false sharing (having multiple atomic
    data in a cache block seems to be considered bad programming
    practice, so this should not be common unless cache block size
    grows), hardware could theoretically provide special word caches
    (or "lossy" block compression where part of the block is dropped)
    for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
    be preferred if false sharing is less of a problem).

    I do not know what the best interface would be, but it seems that
    some care should be taken to account for differing intent when a
    programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
    hints and directives can have unexpected performance changes
    under different microarchitectures or different usage.

    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    Instruction scheduling is in effect a optimization for one implementation
    that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
    /when instructions are not (or lesser) scheduled.

    Loop unrolling is a case where if your machine has vVM the code
    runs faster when not {unrolled and executed with branch prediction}
    than when {}.

    (I think I may not be communicating well. I am kind of tired
    right now and this topic is complicated.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Feb 5 21:27:51 2026
    From Newsgroup: comp.arch


    MitchAlsup <[email protected]d> posted:


    Paul Clayton <[email protected]> posted:

    -----------------
    Good point. Humm. Well, I just don't want to see the disassembly of
    atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)

    If you do it LL/SC-style you HAVE to bring data to "this" particular
    CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under contention. So you DON"T DO IT LIKE THAT.

    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.
    Explain
    A programmer
    might use predication to avoid a branch that is expected to be
    poorly predicted or to have more consistent execution time. The

    My 66000 predication can avoid 2 branches--it operates under the
    notion that if FETCH reaches the join point before condition is
    known, then predication is always faster than branching.

    former could be inappropriate for the computer to obey if the
    branch predictor became effective for that branch. If prediction
    is accurate, predicate prediction could improve performance but
    would break execution time consistency. Even reducing execution
    time when the predicate is known early might go against the
    programmer's intent by leaking information.

    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it

    In My 66000 architecture, an atomic operation on memory will have
    the property of being executed where the cache line in a modifiable
    state happens to reside. Here you don't want to "bring data in"
    nor do you want to "push data out" you want the memory hierarchy
    to be perturbed as little as possible WHILE performing the ATOMIC
    event.

    The cost is that each cache will have an adder--with the sizes of
    cache these days, said adder is a trifling in area and seldom used
    so it is low power.

    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    I have been betting that (in the general case) software will
    remain incapable of doing such predictions, for quite some time.

    If the contention is from false sharing (having multiple atomic
    data in a cache block seems to be considered bad programming
    practice, so this should not be common unless cache block size
    grows), hardware could theoretically provide special word caches
    (or "lossy" block compression where part of the block is dropped)
    for moderating the impact of false sharing. This would change the optimization preferences for the program (more compact data might
    be preferred if false sharing is less of a problem).

    I do not know what the best interface would be, but it seems that
    some care should be taken to account for differing intent when a programmer suggests a specific mechanism. This also gets into the distinction/spectrum/space between a hint and a directive. Both
    hints and directives can have unexpected performance changes
    under different microarchitectures or different usage.

    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    Instruction scheduling is in effect a optimization for one implementation that may not hold for other implementations. In Order pipelines need instruction scheduling, OoO do not, GBOoO generally perform better if
    /when instructions are not (or lesser) scheduled.

    Loop unrolling is a case where if your machine has vVM the code
    runs faster when not {unrolled and executed with branch prediction}
    than when {}.

    (I think I may not be communicating well. I am kind of tired
    right now and this topic is complicated.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun Feb 8 16:42:56 2026
    From Newsgroup: comp.arch

    On 12/19/25 12:41 PM, Anton Ertl wrote:
    John Savard <[email protected]d> writes:
    On Thu, 18 Dec 2025 21:29:00 +0000, MitchAlsup wrote:
    Or in other words, if you can decode K-instructions per cycle, you'd
    better be able to execute K-instructions per cycle--or you have a
    serious blockage in your pipeline.

    No.

    If you flipped "decode" and "execute" in that sentence above, I would 100% >> agree. And maybe this _is_ just a typo.

    But if you actually did mean that sentence exactly as written, I would
    disagree. This is why: I regard executing instructions as 'doing the
    actual work' and decoding instructions as... some unfortunate trivial
    overhead that can't be avoided.

    It does not matter what "the actual work" is and what isn't. What
    matters is how expensive it is to make a particular part wider, and
    how paying that cost benefits the IPC. At every step you add width to
    the part with the best benefit/cost ratio.

    And looking at recent cores, we see that, e.g., Skymont can decode
    3x3=9 instructions per cycle, rename 8 per cycle, has 26 ports to
    functional units (i.e., can execute 26 uops in one cycle); I don't
    know how many instructions it can retire per cycle, but I expect that
    it is more than 8 per cycle.

    So the renamer is the bottleneck, and that's also the idea behind
    top-down microarchitecture analysis (TMA) for determining how software interacts with the microarchitecture. That idea is coming out of
    Intel, but if Intel is finding it hard to make wider renamers rather
    than wider other parts, I expect that the rest of the industry also
    finds that hard (especially for architectures where decoding is
    cheaper), and (looking at ARM A64) where instructions with more
    demands on the renamer exist.

    It is not clear to me that the renamer is clearly the bottleneck
    merely because it is narrower. Wider rename might not increase
    performance that much. The 9-instruction decode is a consequence
    of using three similar 3-wide decoders with predicting three
    targets (so taken branches can be handled). If the target
    predictor for straightline code is for each three instructions,
    then all the decoders should be the same width. Avoiding 4-wide
    decoders makes sense both to support more taken branches and to
    avoid the extra instruction length determination complexity.

    (Since rename is 8 µops and decode is 9 _instructions_ the width
    difference is greater than it might seem from the numbers. I
    think x86 designs still have more instruction cracking than
    instruction fusion.)

    A component can be small because it is expensive to make it
    larger or because there is less benefit to making it larger. I
    would consider the former a bottleneck for the designers, but
    the later seems to imply that other constraints should be
    loosened before more design effort is applied to that component.

    I have not been paying attention to achieved IPC, but I got the
    impression that for messy "integer" code achieved IPC was not up
    to 4. While the front end needs to process more instructions
    than are committed due to branch mispredictions, it seems at
    least plausible that wider rename is not that helpful generally.

    Of course, supporting bursts makes sense — otherwise execute
    would not be so wide nor retirement — but more order-constrained
    front-end may not benefit as much from enabling bursts of high
    activity.

    With banking and replication a lot of RAT read and write ports
    could be supported and if rename-width internal forwarding
    avoids RAT access (instead transferring from the free list) the
    port count could be further reduced. Merging reused source names
    could also reduce RAT port count.

    I am not suggesting that increasing rename width is trivial; it
    seems somewhat similar to the dependency tracking for in-order
    superscalar execution, which is part of what motivated VLIW.

    (In theory, dependent operations could have higher rename
    latency because they cannot execute for at least one cycle after
    the source value providing operation. I do not recall reading
    any proposal to exploit such and it would increase the
    complexity of filling the scheduler. Perhaps such variable
    rename latency might be useful in a banked RAT where bank
    conflicts could delay renaming. Preferring non-dependent (and
    older) operations might reduce the impact of bank contention.)

    I *suspect* that there is some architectural and
    microarchitectural opportunity to reduce the overhead of
    renaming, checkpointing, and other out-of-order execution
    mechanisms. Caching dependency information in a decoded
    instruction cache might be helpful, e.g.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun Feb 8 17:48:13 2026
    From Newsgroup: comp.arch

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <[email protected]d> posted:

    Paul Clayton <[email protected]> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
    Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing
    instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    Yes. Even a single adder would have higher throughput than ping-
    ponging a cache block. One might even support a three-or-more
    input two-or-more result adder to improve throughtput (or
    perhaps exploit usually smaller addends) to increase throughput,
    though I suspect there would practically never be a case where a
    simple adder would have insufficient throughput.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.

    Explain

    Branch hints can be intended to reduce branch predictor aliasing
    (i.e., assume the static hint is used instead of a dynamic
    predictor), to provide agree information, to prefer one path
    even if it is less likely, to provide an initialization of the
    (per-address component only?) branch predictor, or for some
    other motive. The interface/architecture might not be specific
    about how such information will be used, especially if it is a
    hint (and programmers might disagree about what the best
    interface would be). If the interface is not very specific, a
    microarchitecture might violate a programmer's
    desire/expectation by ignoring the hint or using it in a
    different way.

    Similarly predication can be motivated to avoid fetch
    redirection (initial ARM and My 66000), to facilitate constant
    time execution, to avoid the performance cost of branch
    mispredictions, or perhaps for some reason that does not come to
    mind. Predicate prediction would foil constant time execution
    and might reduce performance (or merely introduce weird
    performance variation). Even the fetch optimization might be
    undone if the hardware discovers that the condition is extremely
    biased and folds out the rarely used instructions; which would
    be good for performance if the bias continues, but if the bias
    changes just frequently enough it could hurt performance.

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    If the requesting core has the cache block in exclusive or
    modified state, remote execution might be less efficient. Yet it

    In My 66000 architecture, an atomic operation on memory will have
    the property of being executed where the cache line in a modifiable
    state happens to reside. Here you don't want to "bring data in"
    nor do you want to "push data out" you want the memory hierarchy
    to be perturbed as little as possible WHILE performing the ATOMIC
    event.

    Okay, that makes sense. The earlier statement implied (to me)
    that the operation was always "centralized".

    The cost is that each cache will have an adder--with the sizes of
    cache these days, said adder is a trifling in area and seldom used
    so it is low power.

    may also be possible that the block is expected to be moved to
    another core such that this pushing the data manipulation to a
    "centralized" location would improve performance as was the
    programmer intent (rather than avoiding contention overhead). (I
    suppose in My 66000, a programmer would use a push/"prefetch"
    instruction to move the cache block to a more central location,
    but even that might be sub-optimal if the hardware can predict
    the next data consumer such that centrally located would be
    slower than shifting the data closer to the consumer.)

    I have been betting that (in the general case) software will
    remain incapable of doing such predictions, for quite some time.

    A bet with very good odds in general, but I am sure there are
    still more than one "Mel" around who could optimize data
    movement.

    In cases like simple pipelines, the data communication pattern
    is obvious. For an embedded system, communicating stores might
    go directly to another cores local memory. With more general
    purpose compute, thread migration would be more common (even
    with more cores than runnable threads) and abstracting a
    communication interface might be more challenging.

    [snip]
    Clever microarchitecture can make some optimizations sub-optimal
    as well as cause programmers to ask "why didn't you do it the way
    I told you to do it?!"

    Instruction scheduling is in effect a optimization for one implementation
    that may not hold for other implementations. In Order pipelines need
    instruction scheduling, OoO do not, GBOoO generally perform better if
    /when instructions are not (or lesser) scheduled.

    Even OoO implementations can benefit from facilitating
    instruction fusion and perhaps even scheduler allocation to
    reduce communication overhead. A RAT-based renamer might also
    benefit from having duplicate register names in the same rename
    chunk. With in-order renaming, in theory a compiler could also
    optimize RAT bank conflicts. This might not be considered
    scheduling, but when the instruction stream is serial placement
    is effectively scheduling.

    Loop unrolling is a case where if your machine has vVM the code
    runs faster when not {unrolled and executed with branch prediction}
    than when {}.

    Yeah.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Feb 9 19:28:25 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <[email protected]d> posted:

    Paul Clayton <[email protected]> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Write, in pure K&R C, the functionality for LoadLocked and
    StoreConditional.

    Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not
    Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}

    I wonder if there is an issue of communicating intention to the
    computer. Using atomic-to-memory may be intended to communicate
    that the operation is expected to be under contention or that
    moderating the impact under high contention is more important
    than having a fast "happy path".

    There is a speed of light problem here. Communicating across a
    computer <system> is a microsecond time problem, whereas executing
    instructions is a nanosecond time problem.

    And this is exactly where Add-to-Memory gains over Interferable
    ATOMIC events--you only pay the latency once, now while the latency
    is higher than possible with LL-SC, it is WAY LOWER than worst case
    with LL-SC under serious contention.

    Yes. Even a single adder would have higher throughput than ping-
    ponging a cache block. One might even support a three-or-more
    input two-or-more result adder to improve throughtput (or
    perhaps exploit usually smaller addends) to increase throughput,
    though I suspect there would practically never be a case where a
    simple adder would have insufficient throughput.

    This seems to be similar to branch hints and predication in that
    urging the computer to handle the task in a specific way may not
    be optimal for the goal of the user/programmer.

    Explain

    Branch hints can be intended to reduce branch predictor aliasing
    (i.e., assume the static hint is used instead of a dynamic
    predictor), to provide agree information, to prefer one path
    even if it is less likely, to provide an initialization of the
    (per-address component only?) branch predictor, or for some
    other motive. The interface/architecture might not be specific
    about how such information will be used, especially if it is a
    hint (and programmers might disagree about what the best
    interface would be). If the interface is not very specific, a microarchitecture might violate a programmer's
    desire/expectation by ignoring the hint or using it in a
    different way.

    Similarly predication can be motivated to avoid fetch
    redirection (initial ARM and My 66000), to facilitate constant
    time execution, to avoid the performance cost of branch
    mispredictions, or perhaps for some reason that does not come to
    mind. Predicate prediction would foil constant time execution
    and might reduce performance (or merely introduce weird
    performance variation). Even the fetch optimization might be
    undone if the hardware discovers that the condition is extremely
    biased and folds out the rarely used instructions; which would
    be good for performance if the bias continues, but if the bias
    changes just frequently enough it could hurt performance.

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    Time constancy is provided by execution both then clause and else clause
    and using CMOV to decide on flow.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 9 20:36:54 2026
    From Newsgroup: comp.arch

    On 2/9/26 2:28 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <[email protected]d> posted:

    Paul Clayton <[email protected]> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Write, in pure K&R C, the functionality for LoadLocked and
    StoreConditional.

    Then why would the compiler not inline such? It seems reasonable
    (to me) to blame poor scaling performance in that case on the
    compiler which did not inline a one instruction function (or on
    the developer who intentionally disabled such optimization).

    [snip]

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    Time constancy is provided by execution both then clause and else clause
    and using CMOV to decide on flow.

    This means ordinary My 66000 predication cannot be used for
    such. I have no idea whether writers of cryptographic software
    would be upset that a mechanism which seems like it would
    provide constant time is documented not to do so. (I would
    _guess_ that most of the embedded strong guarantee real time
    software might select hardware that never does predict
    predication since caches and similar general purpose
    optimizations tend to increase the difficulty of providing
    strong timing guarantees.)


    Is My 66000 going to have CMOV? What about something like some
    of the AArch64 simple conditional operations? (Yes, such gets
    un-RISCy both from count of defined instructions and instruction
    complexity, but I have read that the supported operations are
    somewhat common for single instruction hammock branches and are
    easier to fit microarchitecturally and to fit within a 32-bit
    instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
    encoding (i.e., vagueries of idiom recognition — though an
    encoded instruction can be implemented with performance that
    breaks expectations) are disadvantages of just using PRED, but
    PRED also removes artificial restrictions like only incrementing
    by 1 (such that idiom recognition could be implemented to fast
    path any conditional add_immediate).

    One interesting case for predication is code like the following:
    if (a < 25) {a += 8;}

    The predicate is known to be available at least early enough to
    nullify the addition executed in parallel with the compare
    because the register operand is the same and the only other
    operands are immediates. It is not clear if this presents a
    useful optimization opportunity, but it seems an interesting
    case.

    I am also curious about what you view as the trade-offs of
    conditional move compared with conditional select. Obviously
    conditional select potentially includes a register copy and for
    typical OoO implementations the mechanism is similar because a
    conditional move renames the destination and reads the old value
    and the alternative value. Encoding the condition source may be
    harder for conditional select (e.g., test register A is/is not
    zero, based on test place register B or register C in register
    D, which requires four register names while conditional move
    exploits that one of the sources is the same as the
    destination); for My 66000 a generalized condition after a
    comparison would, I think, require a 6-bit condition bit
    specifier and a register source for the condition, which just
    fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
    specifier, four 5-bit register names).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Tue Feb 10 15:55:05 2026
    From Newsgroup: comp.arch

    Paul Clayton <[email protected]> writes:
    On 2/9/26 2:28 PM, MitchAlsup wrote:


    This means ordinary My 66000 predication cannot be used for
    such. I have no idea whether writers of cryptographic software
    would be upset that a mechanism which seems like it would
    provide constant time is documented not to do so. (I would
    _guess_ that most of the embedded strong guarantee real time
    software might select hardware that never does predict
    predication since caches and similar general purpose
    optimizations tend to increase the difficulty of providing
    strong timing guarantees.)

    ARMv8 includes a "DIT" feature. Data Independent Timing. If
    that feature is present and enabled, software can be
    assured that a DIT capable instruction that processes
    data completes in the same time, regardless of the
    data being processed. Specifically applies to instructions that
    manipulate arithmetic data (loads, stores, integer, floating point,
    conditional compares, crypto instructions, etc).

    https://developer.arm.com/documentation/ddi0601/2025-12/AArch64-Registers/DIT--Data-Independent-Timing?lang=en

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Feb 10 19:02:09 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 2/9/26 2:28 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 2/5/26 4:27 PM, MitchAlsup wrote:

    MitchAlsup <[email protected]d> posted:

    Paul Clayton <[email protected]> posted:

    [snip]
    LL-op-SC could be recognized as an idiom and avoid bringing data
    to the core.

    Can recognize:

    LDL Rd,[address]
    ADD Rd,Rd,#whatever
    STC Rd,[address]

    Cannot recognize:

    LDA R1,[address]
    CALL LoadLocked
    ADD R2,R2,#whatever
    CALL StoreConditional

    When would one want to decouple LL and SC into function calls
    away from the computation? Perhaps for in-place software
    instrumenation?

    Write, in pure K&R C, the functionality for LoadLocked and StoreConditional.

    Then why would the compiler not inline such? It seems reasonable
    (to me) to blame poor scaling performance in that case on the
    compiler which did not inline a one instruction function (or on
    the developer who intentionally disabled such optimization).

    Because in pure K&R C there is no concept of atomic things, and
    thus one has to resort to ASM--beyond this, there is no concept
    of inlining.

    [snip]

    [snip]
    There is no reason not to predict My 66000-style predication,
    nor is there any great desire/need TO predict them, either.

    Except that prediction could violate the time constancy assumed
    by the programmer.

    Time constancy is provided by execution both then clause and else clause and using CMOV to decide on flow.

    This means ordinary My 66000 predication cannot be used for
    such. I have no idea whether writers of cryptographic software
    would be upset that a mechanism which seems like it would
    provide constant time is documented not to do so.

    I neither stated such nor implied such.

    (I would
    _guess_ that most of the embedded strong guarantee real time
    software might select hardware that never does predict
    predication since caches and similar general purpose
    optimizations tend to increase the difficulty of providing
    strong timing guarantees.)


    Is My 66000 going to have CMOV?

    Has had it for years, it also has multi-instruction predication
    since forever.

    What about something like some
    of the AArch64 simple conditional operations? (Yes, such gets
    un-RISCy both from count of defined instructions and instruction
    complexity, but I have read that the supported operations are
    somewhat common for single instruction hammock branches and are
    easier to fit microarchitecturally and to fit within a 32-bit
    instruction.) The code bloat (8 bytes vs. 4 bytes) and indirect
    encoding (i.e., vagueries of idiom recognition — though an
    encoded instruction can be implemented with performance that
    breaks expectations) are disadvantages of just using PRED, but
    PRED also removes artificial restrictions like only incrementing
    by 1 (such that idiom recognition could be implemented to fast
    path any conditional add_immediate).

    One interesting case for predication is code like the following:
    if (a < 25) {a += 8;}

    The predicate is known to be available at least early enough to
    nullify the addition executed in parallel with the compare
    because the register operand is the same and the only other
    operands are immediates. It is not clear if this presents a
    useful optimization opportunity, but it seems an interesting
    case.

    I am also curious about what you view as the trade-offs of
    conditional move compared with conditional select. Obviously
    conditional select potentially includes a register copy and for
    typical OoO implementations the mechanism is similar because a
    conditional move renames the destination and reads the old value
    and the alternative value. Encoding the condition source may be
    harder for conditional select (e.g., test register A is/is not
    zero, based on test place register B or register C in register
    D, which requires four register names while conditional move
    exploits that one of the sources is the same as the
    destination); for My 66000 a generalized condition after a
    comparison would, I think, require a 6-bit condition bit
    specifier and a register source for the condition, which just
    fits in a 32-bit instruction (6-bit opcode, 6-bit condition bit
    specifier, four 5-bit register names).

    Clang goes out of its way to convert things like the above into
    CMOV form. Often this pre-optimization takes more instructions
    and runs slower than pure-predication. Almost always when then-
    clause or else-clause is not-already in a register, the CMOV
    version is longer and slower.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Tue Feb 10 20:23:12 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:

    Paul Clayton <[email protected]> posted:

    On 2/9/26 2:28 PM, MitchAlsup wrote:
    <snip>


    Write, in pure K&R C, the functionality for LoadLocked and
    StoreConditional.

    Then why would the compiler not inline such? It seems reasonable
    (to me) to blame poor scaling performance in that case on the
    compiler which did not inline a one instruction function (or on
    the developer who intentionally disabled such optimization).

    Because in pure K&R C there is no concept of atomic things, and
    thus one has to resort to ASM--beyond this, there is no concept
    of inlining.

    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Tue Feb 10 20:39:11 2026
    From Newsgroup: comp.arch

    According to Scott Lurndal <[email protected]>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Tue Feb 10 21:15:08 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    According to Scott Lurndal <[email protected]>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    The 11/34 was what I had been using in the late 70's (Unix V6)
    and a bit of RSTS/E. That was after several years of TSS8.24
    where self-modifying code was de rigueur.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Wed Feb 11 09:19:14 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Scott Lurndal <[email protected]>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal prefix.

    This could well be faster than a complicated sort operator, even if it
    was run-time generated, particularly since that runtime code would need
    to be called.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Feb 11 22:11:09 2026
    From Newsgroup: comp.arch


    Terje Mathisen <[email protected]> posted:

    John Levine wrote:
    According to Scott Lurndal <[email protected]>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an ancient technique widely used by sort programs to speed up the inner comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was trivially comparable.

    40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal prefix.

    This could well be faster than a complicated sort operator, even if it
    was run-time generated, particularly since that runtime code would need
    to be called.

    Terje

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Wed Feb 11 22:12:06 2026
    From Newsgroup: comp.arch

    According to Terje Mathisen <[email protected]>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was >trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal >prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    See, for example:

    https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Thu Feb 12 14:23:50 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Terje Mathisen <[email protected]>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was
    trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal
    prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    See, for example:

    https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf

    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was
    full) you would run the channel transfer in the opposite direction to
    retrieve the sorted data.
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Thu Feb 12 15:47:01 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    According to Terje Mathisen <[email protected]>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract >>the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was >>trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data >>and accept that the full sort would only deliver a partially sorted >>results, so that I had to do a final sort pass over each chunk of equal >>prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. >Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    Indeed, the sort subsystem on the Burroughs systems was, perhaps,
    even more important that the MCP itself to a lot of customers.

    Although large sorts in those days were done using magnetic tapes.

    (watching a large sort-merge running on a dozen 9-track drives
    was pretty impressive).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Thu Feb 12 19:25:03 2026
    From Newsgroup: comp.arch

    According to Scott Lurndal <[email protected]>:
    I'm pretty sure you'll find all of those in sort programs from the 1960s. >>Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    Indeed, the sort subsystem on the Burroughs systems was, perhaps,
    even more important that the MCP itself to a lot of customers.

    Although large sorts in those days were done using magnetic tapes.

    (watching a large sort-merge running on a dozen 9-track drives
    was pretty impressive).

    Yup. I suspect that the read backward feature on tape drives was invented to speed up sorting, when someone realized that if you'd written a set of tapes, you could read them backward and merge into the next set of tapes in reverse order, repeat until done. You might need an extra pass at the end if the final result was an even-numbered pass written in reverse order, but you still won overall because you didn't have to rewind between all the intermediate passes. --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Thu Feb 12 20:01:38 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    According to Scott Lurndal <[email protected]>:
    I'm pretty sure you'll find all of those in sort programs from the 1960s. >>>Sorting used more computer time than anything else so they put a great >>>deal of work into making it fast.

    Indeed, the sort subsystem on the Burroughs systems was, perhaps,
    even more important that the MCP itself to a lot of customers.

    Although large sorts in those days were done using magnetic tapes.

    (watching a large sort-merge running on a dozen 9-track drives
    was pretty impressive).

    Yup. I suspect that the read backward feature on tape drives was invented to >speed up sorting,

    Indeed. The Burroughs SORT guru[*] was quite proud of his
    speed enhancements when read-reverse became available.

    His SORT intrinsics (SORT.(60s) and SORT:(70s-90s)) were
    extremely efficient (hand written assembler, for the most part)
    and used by all languages that had sorting capabilities (COBOL68/74/85,
    BPL, SPRITE).

    [*] RIP Bernie Buller, 2023/09/11.

    (He was also a fan of folk music and Oldsmobiles, and a tribute is up on youtube).
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Thu Feb 12 20:24:00 2026
    From Newsgroup: comp.arch

    In article <10ml9ef$264n$[email protected]>, [email protected] (John Levine)
    wrote:

    I suspect that the read backward feature on tape drives was
    invented to speed up sorting, when someone realized that if
    you'd written a set of tapes, you could read them backward
    and merge into the next set of tapes in reverse order, repeat
    until done.

    Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. <https://www.amazon.co.uk/dp/0262517205>

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Thu Feb 12 23:07:16 2026
    From Newsgroup: comp.arch

    It appears that Terje Mathisen <[email protected]> said:
    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.

    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in
    September 2020 which adds a complex SORT LISTS instruction that does
    in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Thu Feb 12 23:13:23 2026
    From Newsgroup: comp.arch

    According to John Dallman <[email protected]>:
    In article <10ml9ef$264n$[email protected]>, [email protected] (John Levine) >wrote:

    I suspect that the read backward feature on tape drives was
    invented to speed up sorting, when someone realized that if
    you'd written a set of tapes, you could read them backward
    and merge into the next set of tapes in reverse order, repeat
    until done.

    Emerson W Pugh's _IBM's 360 & Early 370 Systems_ says that was the reason. ><https://www.amazon.co.uk/dp/0262517205>

    So it does, on page 229. I have a paper copy of that book that I read long ago, so I guess I can't claim the idea was original.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Feb 13 00:04:24 2026
    From Newsgroup: comp.arch


    John Levine <[email protected]> posted:

    It appears that Terje Mathisen <[email protected]> said:
    My "Algorithms and Data Structures" professor told us that IBM had >patented a zero-time sort chip, which they then promptly buried since >their internal stats said that over all IBM mainframes, 60% of the CPU >hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was >full) you would run the channel transfer in the opposite direction to >retrieve the sorted data.

    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    I remember hearing about the chip (circa 1986-8).
    I agree that the number of entries would not have been enough to justify.
    I also agree that Interrupt response time after the I/O would have
    prevented any useful performance advantage.

    By the mid-1990s the entry count could have been large enough to justify.

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Thu Feb 12 23:34:10 2026
    From Newsgroup: comp.arch

    On 2/11/2026 2:12 PM, John Levine wrote:
    According to Terje Mathisen <[email protected]>:
    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was
    trivially comparable.

    The final option was to extract a short prefix (8-bytes) of the key data
    and accept that the full sort would only deliver a partially sorted
    results, so that I had to do a final sort pass over each chunk of equal
    prefix.

    I'm pretty sure you'll find all of those in sort programs from the 1960s. Sorting used more computer time than anything else so they put a great
    deal of work into making it fast.

    See, for example:

    https://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6543-3_Sort_Merge_Feb67.pdf

    Yup. And note that having a faster sort program than the standard IBM
    one made the company Syncsort a lot of money.

    https://en.wikipedia.org/wiki/Precisely_(company)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Fri Feb 13 11:54:15 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <[email protected]> posted:

    John Levine wrote:
    According to Scott Lurndal <[email protected]>:
    I do believe that it would be possible in K&R C.

    Create a static array of bytes with the appropriate
    object code to perform the operation followed
    by a RET instruction. Cast the address
    of the array to a function pointer and call it.

    Or take the address of a branch label in an unreachable
    point in a function (e.g. after a return), overwrite
    the text at that label with the necessary object code
    then branch to it (text was generally RW in the K&R days,
    eg on the PDP-11).

    Sometimes. On most of the PDP-11s that ran Unix, the 11/45, /70, and
    /44, user programs had separate instuction and data mappings. The
    smaller ones, the 11/34, /40, and /60 didn't.

    We knew it was possible to do runtime code generation, since it's an
    ancient technique widely used by sort programs to speed up the inner
    comparison loop, but we had enough trouble getting our programs to
    work the normal way.

    40 years ago my own sort functions/programs would instead either extract
    the compound keys into a form that could be compared directly, or
    re-write (with a reversible transform) the data in place, so that it was
    trivially comparable.

    40 years ago, Duff's device was considered a reasonable way to improve performance. It is now considered a means to take over applications

    Was it ever that?

    Yes, it was a very compact way to express the same logic you would use
    in asm to jump into the correct spot of an unrolled loop, but did it
    actually save any time compared to a plain unroll followed by a final fixup?

    and machines (i.e., a virus).

    Duff would never compile into anything strange, but as a way to
    obfuscate it could work I guess? Absolutley no need to flag it as malware/virus!

    My own code which would jump into the immediate constant part of another instruction, treating data as code, could reasonably trigger some warnings.

    OTOH, "The Story of Mel" will forwever be the shining star for us
    oldtimers, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Fri Feb 13 11:36:05 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    John Levine <[email protected]> posted:

    It appears that Terje Mathisen <[email protected]> said:
    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was >>> full) you would run the channel transfer in the opposite direction to
    retrieve the sorted data.
    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    I remember hearing about the chip (circa 1986-8).
    I agree that the number of entries would not have been enough to justify.
    I also agree that Interrupt response time after the I/O would have
    prevented any useful performance advantage.

    By the mid-1990s the entry count could have been large enough to justify.

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in
    September 2020 which adds a complex SORT LISTS instruction that does
    in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.


    A “Zero-Time” VLSI Sorter, 1983, IBM Journal of Research and Development https://scholar.archive.org/work/jnapdo24jbfaxhnxbiybonbqxu/access/wayback/http://pdfs.semanticscholar.org/e72e/903f94fb7c121efca13a717a1e713b3321f1.pdf

    "A hardware sorter suitable for VLSI implementation is proposed.
    It operates in a parallel and pipelined fashion, with the actual
    sorting time absorbed by the input/output time. A detailed VLSI
    implementation is described which has a very favorable device
    count compared to existing static RAM."



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Fri Feb 13 18:53:59 2026
    From Newsgroup: comp.arch

    According to Terje Mathisen <[email protected]>:
    40 years ago, Duff's device was considered a reasonable way to improve
    performance. It is now considered a means to take over applications

    Was it ever that?

    Yes, it was a very compact way to express the same logic you would use
    in asm to jump into the correct spot of an unrolled loop, but did it >actually save any time compared to a plain unroll followed by a final fixup?

    Time, no, but on a 64K machine space was precious and it got much of the benefit of unrolling without as much space.

    Duff would never compile into anything strange, but as a way to
    obfuscate it could work I guess? Absolutley no need to flag it as >malware/virus!

    Agreed, it was ugly but quite legit.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Sat Feb 14 10:29:35 2026
    From Newsgroup: comp.arch

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
    <[email protected]d> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is
    useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
    steps succeed.

    I keep track of how many setup steps successfully complete, and then,
    whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jm@[email protected] to comp.arch on Sat Feb 14 21:04:05 2026
    From Newsgroup: comp.arch

    George Neuner <[email protected]> writes:

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <[email protected]d> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup
    steps succeed.

    I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.

    This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
    a while starting before the first label, but you can do other strange
    things like jumping into an if:

    switch(i) {
    case 1:
    if (j == 3) {
    case 2:
    printf("i == 2 || (i == 1 && j == 3)\n");
    }
    }

    or is you tears down complex enough to use such things?

    Yours,
    --
    Jean-Marc
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Feb 14 20:17:05 2026
    From Newsgroup: comp.arch


    jm <[email protected]> posted:

    George Neuner <[email protected]> writes:

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup <[email protected]d> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >>performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is useful. I use the switch() form for backing out of situations having complicated setup where processing isn't possible unless all the setup steps succeed.

    I keep track of how many setup steps successfully complete, and then, whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.

    This seems missing an important aspect of Duff switch: the use of the possibility to put case labels inside control structure. Duff's device use
    a while starting before the first label, but you can do other strange
    things like jumping into an if:

    switch(i) {
    case 1:
    if (j == 3) {
    case 2:
    printf("i == 2 || (i == 1 && j == 3)\n");
    }
    }
    ----------------------------------------------------------------
    if( i == 2 || (i == 1 && j == 3) )
    printf("i == 2 || (i == 1 && j == 3)\n");

    is going to be fewer instructions.

    or is you tears down complex enough to use such things?

    Yours,

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Sat Feb 14 17:58:00 2026
    From Newsgroup: comp.arch

    On Sat, 14 Feb 2026 21:04:05 +0100, jm <[email protected]> wrote:

    George Neuner <[email protected]> writes:

    On Wed, 11 Feb 2026 22:11:09 GMT, MitchAlsup
    <[email protected]d> wrote:


    40 years ago, Duff's device was considered a reasonable way to improve >>>performance. It is now considered a means to take over applications
    and machines (i.e., a virus).

    I don't know that it ever really improved performance, but it still is
    useful. I use the switch() form for backing out of situations having
    complicated setup where processing isn't possible unless all the setup
    steps succeed.

    I keep track of how many setup steps successfully complete, and then,
    whatever the exit situation, I jump into a Duff switch that tears it
    all down in reverse order of the setup.

    This seems missing an important aspect of Duff switch: the use of the >possibility to put case labels inside control structure. Duff's device use
    a while starting before the first label, but you can do other strange
    things like jumping into an if:

    switch(i) {
    case 1:
    if (j == 3) {
    case 2:
    printf("i == 2 || (i == 1 && j == 3)\n");
    }
    }

    or is you tears down complex enough to use such things?

    Yours,

    I've never had to do anything like your example. I realize that it
    would work, but in fact I've actually never seen anything like that
    done in real life.

    For me, the utility of Duff's switch is that it provides multiple
    entry points into ... what arguably could be considered a single block
    of related code.

    Of course, C permits any case in a switch to fall through. What makes
    a particular case run a "Duff" (to me) is that the run includes at
    least 2 cases which lead to separate[*] code, and every case in the
    run falls through except the last one which exits the switch.

    [*] separate. not necessarily different or unique.

    I try not to put multiple "Duffs" in the same switch. I have had
    occasion to do it, but only very rarely.


    Certainly Duffs can do more than I use them for. I use them only for
    what I need.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Feb 15 09:36:18 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    These days z/Series has the optional Enhanced-Sort Facility added in >September 2020 which adds a complex SORT LISTS instruction that does >in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.

    If the microarchitecture is worth anything, microcode does not run any
    faster than architectural code, when doing the same operations.

    One possible reason for the additional instruction is that there is a
    hardware feature that provides a speedup; in that case the hardware
    feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    For such hardware features the question is what should be exposed at
    the architectural level. One can use it internally in a high-level
    instruction like SORT LISTS, or expose it more directly in a
    lower-level instruction (e.g., Intel has AESENC, which just performs
    one round of AES encryption).

    The advantage of the lower-level instruction X is that it may have
    other uses, and no microcode for SORT LISTS is necessary (software is
    more malleable); the disadvantage is that the specific functionality
    of the hardware feature is now architectural, i.e., cast in stone, and
    if a different hardware feature in the future would be better or
    cheaper for the SORT LISTS functionality, one still has to provide the
    X functionality in some way, and then decide whether to also implement
    the other functionality, or just live with X.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sun Feb 15 14:34:11 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    It appears that Terje Mathisen <[email protected]> said:
    My "Algorithms and Data Structures" professor told us that IBM had
    patented a zero-time sort chip, which they then promptly buried since
    their internal stats said that over all IBM mainframes, 60% of the CPU
    hours were spent in various forms of sorting.

    Terje

    PS. The chip was a ladder of comparators, so that you could use a
    channel to stream data into it, then when you were done (or the chip was
    full) you would run the channel transfer in the opposite direction to
    retrieve the sorted data.

    Sounds like an urban legend to me. For one thing, on machines of that
    era no useful sort fit in core so it spent most of its time doing disk
    and tape I/O anyway.

    Professor Halaas (he had written the textbook we used and later on got
    FAST to actually produce a search chip) quoted an actual patent number
    afair.

    And that was probably the main (but not only!) reason for not actually creating the hardware. Besides, patents were already a way to gain bonuses/salary increases, right?

    For another, if it really did make a difference, they would have
    called it the Improved Sort Performance Facility and sold it as an
    extra cost option.

    These days z/Series has the optional Enhanced-Sort Facility added in September 2020 which adds a complex SORT LISTS instruction that does in-memory sorts or merges presuably in microcode so it can run at full
    memory speed.


    As soon as a sort is large enough that it needs to start in physical RAM
    and also end up there, the best you can do is to minimize the number of
    cache and TLB misses: The actual branch misses probably don't matter?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Mon Feb 16 20:55:06 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <[email protected]>:
    John Levine <[email protected]> writes:
    These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.

    If the microarchitecture is worth anything, microcode does not run any
    faster than architectural code, when doing the same operations.

    One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware
    feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
    a while so it might be that the microcode can run the memory at full speed better
    than ordinary instrucions can.

    https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

    This isn't their first sort speedup. A while ago they added instructions that do
    the inner loop of heapsort. Not sure whether there's hardware for that.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Feb 16 22:16:11 2026
    From Newsgroup: comp.arch


    John Levine <[email protected]> posted:

    According to Anton Ertl <[email protected]>:
    John Levine <[email protected]> writes:
    These days z/Series has the optional Enhanced-Sort Facility added in >>September 2020 which adds a complex SORT LISTS instruction that does >>in-memory sorts or merges presuably in microcode so it can run at full >>memory speed.

    If the microarchitecture is worth anything, microcode does not run any >faster than architectural code, when doing the same operations.

    One possible reason for the additional instruction is that there is a >hardware feature that provides a speedup; in that case the hardware >feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    It says here it's a combination of hardware and microcode, but the references are all broken or paywalled. It also says that the instruction can run for quite
    a while so it might be that the microcode can run the memory at full speed better
    than ordinary instrucions can.

    Sounds like a co-processor or an attached-processor using µcode for communication {setup and tear down}. Basically, core-CPU ships request
    to Sorter(tm); sorter does its job, and gets back at lower latency than
    an interrupt. Core-caches will contain post-sorted data.

    https://blog.share.org/Technology-Article/peeking-under-the-hood-of-sort-acceleration-on-z15

    This isn't their first sort speedup. A while ago they added instructions that do
    the inner loop of heapsort. Not sure whether there's hardware for that.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Tue Feb 17 01:19:58 2026
    From Newsgroup: comp.arch

    According to John Levine <[email protected]>:
    One possible reason for the additional instruction is that there is a >>hardware feature that provides a speedup; in that case the hardware >>feature, not the microcode is the reason for the speedup. Is there
    such a hardware feature for SORT LISTS, and if so, what is it?

    I found a 2020 IBM JR&D article that describes it in some detail.
    It's some fairly simple hardware that keeps pointers to the keys
    in fast SRAM so it can do what IBM calls a looser tree merge sort
    and everyone else seems to call a loser tree merge sort.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2