• Re: Matmul in VVM

    From scott@[email protected] (Scott Lurndal) to comp.arch on Fri May 15 22:30:26 2026
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size. A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat May 16 10:22:34 2026
    From Newsgroup: comp.arch

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.
    I am not sure how this would compare with just loading the
    values into the cache on the first iteration.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat May 16 18:09:17 2026
    From Newsgroup: comp.arch


    Thomas Koenig <[email protected]> posted:

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work? >>>
    No. In the vast majority of cases, you reference registers as you do >>> now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to >>> use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.
    I am not sure how this would compare with just loading the
    values into the cache on the first iteration.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat May 16 18:11:56 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    Thomas Koenig <[email protected]> posted:

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work? >> >>>
    No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But >> >>> you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    Not for all fo the thread's memory, I was thinking of this as a
    separate flag, to be set only for special purposes (such as above).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat May 16 22:59:22 2026
    From Newsgroup: comp.arch


    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:

    Thomas Koenig <[email protected]> posted:

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >> >>>>> "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    Not for all fo the thread's memory, I was thinking of this as a
    separate flag, to be set only for special purposes (such as above).

    How does one {programmer or OS} glean that the bit can be set ??
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sun May 17 07:51:02 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:

    Thomas Koenig <[email protected]> posted:

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >> >> >>>>> "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    That would prevent thread[k] from allowing thread[j] access to its
    thread local store via shared pointer.

    Not for all fo the thread's memory, I was thinking of this as a
    separate flag, to be set only for special purposes (such as above).

    How does one {programmer or OS} glean that the bit can be set ??

    The OS could learn by special argument to mmap(), for example.

    ABIs could specify a second stack for local variables which are
    known, by language rules, not to be accessed by other threads -
    an alloca-version, for example.

    Renaming could then be done relative to that second stack pointer.

    Drawback: This would increase calling overhead.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Sun May 17 18:51:12 2026
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work? >>>>
    No. In the vast majority of cases, you reference registers as you do >>>> now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to >>>> use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region,
    like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.

    The OS can designate that page as 'noncacheble', so no
    coherency traffic necessary. It would simply be a faster
    page of memory, with access times closer to cache than DRAM
    and shared by multiple cores (with appropriate software care).


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Mon May 18 09:56:04 2026
    From Newsgroup: comp.arch

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work? >>>>>
    No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be >>>>> used in rare circumstances.

    Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the
    stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.

    The OS can designate that page as 'noncacheble', so no
    coherency traffic necessary. It would simply be a faster
    page of memory, with access times closer to cache than DRAM
    and shared by multiple cores (with appropriate software care).

    That is of course a possibility.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon May 18 17:50:49 2026
    From Newsgroup: comp.arch


    Thomas Koenig <[email protected]> posted:

    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:
    Thomas Koenig <[email protected]> writes:
    Stephen Fuld <[email protected]d> schrieb:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the >>>>>>> registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work? >>>>>
    No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the >>>>stack pointer or some other register as something that the CPU
    can treat as if it were a register, and include in its renaming.

    Such a region should probably be page-aligned and sized
    to an integral multiple of the page size.

    Agreed. A "local thread only" flag could then be set in a
    page table.

    A certain portion
    of the virtual address space could be then mapped to, for example,
    a 4KB bank of high-speed SRAM.

    That could compete with cache, and still cause memory traffic.

    The OS can designate that page as 'noncacheble', so no
    coherency traffic necessary.

    The uncacheable page should not show up in any cache; and on most
    machines travels around the system in data-unit-sizes rather than
    cache-line sizes.

    It would simply be a faster
    page of memory, with access times closer to cache than DRAM

    I cannot see how an uncacheable unit of data can approach L1 cache
    latency.

    and shared by multiple cores (with appropriate software care).

    That is of course a possibility.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon May 18 11:02:17 2026
    From Newsgroup: comp.arch

    On 5/2/2026 11:46 AM, MitchAlsup wrote:

    big snip


    Thomas Koenig <[email protected]> posted:
    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    Any progress?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon May 18 20:20:54 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 5/2/2026 11:46 AM, MitchAlsup wrote:

    big snip


    Thomas Koenig <[email protected]> posted:
    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    Any progress?

    A bit.

    To recover from interrupts while performing multi-memory operation*,
    there is a count register (line aligned) in Thread.Header. By using
    this register instead of the Rd supplied by VEC, exceptions and
    interrupts can be recovered--leaving me 5-bits to more fully express
    VEC functionality.

    (*) MM {memory to memory move} and MS {memory set}

    I was thinking of using some of Rd's bits to describe the width of the
    loop in lanes.

    By using 0 to mean "as many as you have" and other numbers to indirectly specify a loop-recurrence that prevents running wider than Rd used as
    an immediate. Thus, if the compiler found a recurrence preventing width
    it is expressed and the HW does not have to go looking {simplifying
    DECODE a bit}.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon May 18 14:18:17 2026
    From Newsgroup: comp.arch

    On 5/14/2026 9:19 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-11 23:11:07] wrote:
    Let me give one possible implementation. There are certainly others. Say
    you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
    the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
    virtual addressing mechanism would have to be sensitive to whether the
    address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.

    That solves the problem of encoding an indirect register access as
    a LD/ST instruction, but I highly doubt that's the main problem
    introduced by indirect register access.

    It'd actually be easier to just add a new instruction for indirect
    register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).

    Fair enough. I was motivated by saving an op code. But the confusion
    that has generated, has led me to agree with you about using new op
    codes. But a note - I was assuming it wouldn't actually be executed by
    the load/store unit - the use of load/store was "syntactical sugar"


    The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
    which register it needs to read, we're in the middle of the OoO engine,
    and the first thing it needs to do is to figure out which physical
    register corresponds to this logical register (and it needs to find out
    also if that physical register's value has already been delivered).
    The needed information is definitely out there somewhere in the CPU,
    but I'm not sure it can be made available cheaply at that time&place.

    Good point. I have some ideas about how to do it, but they are not
    cheap. :-(. But if the savings in a common application of VVM is big
    enough it might be worth it. I just don't know.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon May 18 14:22:25 2026
    From Newsgroup: comp.arch

    On 5/14/2026 5:17 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 5/14/2026 3:03 PM, Bernd Linsel wrote:
    On 5/13/26 22:52, MitchAlsup wrote:

    Bernd Linsel <[email protected]> posted:

    On 5/13/26 14:02, Bernd Linsel wrote:
    Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, >>>>>> accessible in 1 or 2 clocks, and two transfer instructions

    ldqr Rd, <index>
    stqr Rd, <index>

    This should work our perfectly even in a tight vVM loop.


    Should of course read

    ldqr Rd, Rs      // Rs indexes into ultra-fast on-chip SRAM
    stqr Rs1, Rs2    // Rs2 indexes into ultra-fast on-chip SRAM

    I think "direct addressing" with an immediate index instead of via an >>>>> index in a register is not needed.

    How do you access a different register each loop iteration ???
    if you don't have indexing ???


    It's meant as:

    ld Rd, qregs[Rd] and
    st Rs1, qregs[Rs2],

    OK, that solves the indexing issue.

    i.e. the second register as index into the "quick regs" local SRAM bank, >>> Only aligned full word access possible should be sufficient, so that
    these are really indices, not addresses.

    I must be missing something. Doesn't this quick regs memory have to be
    saved and restored on each context switch? If so, that is very expensive.

    qregs[] is (IS) the actual register file (or files)--so, no added state.

    Huh? In Bernd's post above, he expressly says adding a 4K fast SRAM to
    the core. I don't think he was talking about the register file.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bernd Linsel@[email protected] to comp.arch on Wed May 20 10:06:24 2026
    From Newsgroup: comp.arch

    On 5/18/26 23:22, Stephen Fuld wrote:
    On 5/14/2026 5:17 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:
    I must be missing something.  Doesn't this quick regs memory have to be >>> saved and restored on each context switch?  If so, that is very
    expensive.

    qregs[] is (IS) the actual register file (or files)--so, no added state.

    Huh?  In Bernd's post above, he expressly says adding a 4K fast SRAM to
    the core.  I don't think he was talking about the register file.


    Correct, I meant the "qregs" as additional memory, not as aliases for
    existing registers. This does add a considerable amount additional
    state, and the only solution not to thwart quick context switches with
    for most threads unnecessary state, one would have to add support for
    lazy save/restore on first access, i.e. an additional status bit "qregs
    valid" that is reset with every context switch, and trap every access to qregs[] while the qregs valid flag is unset.

    <s>Another optimization is to keep a score which qregs have been used (written) by a thread at all, and to only save these. To mitigate data
    leaking between threads, all never written qregs must return 0 or raise
    an access violation. But this adds again a lot of state to the thread to
    be saved and restored. Furthermore, the necessary access logic delays
    access times and thus foils the original purpose of qregs[].</s>
    --
    Bernd Linsel

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon May 25 12:00:52 2026
    From Newsgroup: comp.arch

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    C version loop invariant, and cursoring

    #define N 8
    void mm8(double *a, double *b, double *c)
    {
    int i,j,jN,k,kN;
    double *AcijN,*AbkjN,*AaijN;

    for( jN=0; jN<N*N; jN+=N ) {
    AcijN = &c[jN];
    AbkjN = &b[jN];
    for( kN=k=0; k<N; k++,kN+=N ) {
    AaikN = &a[kN];
    bN = AbkjN[k];
    for( i=0; i<N; i++ ) {
    AcijN[i] += AaikN[i] * bN;
    }
    }
    }
    }

    I did get this into:

    mm8:
    ; R1 = &a[0];
    ; R2 = &b[0];
    ; R3 = &c[0]; -------------------------------------------------------------
    MOV RjN,#0 ; R4
    loop1:
    LA RcijN,[Rc,RjN<<3] ; R5
    LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
    MOV RkN,#0 ; R7
    MOV Rk,#0 ; R8
    loop2:
    LA RaikN,[Ra,RkN<<3] ; R9
    LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
    MOV Ri,#0 ; R11
    VEC 8,{}
    loop3:
    LDD Ra,[RaikN,Ri<<3] ; R12
    LDD Rc,[RcijN,Ri<<3] ; R13
    FMAC Rc,Ra,Rb,Rc ; R14
    STD Rc,[RcijN,Ri<<3] ;

    LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
    ADD Rk,Rk,#1 ; R8
    ADD RkN,RkN,#8 ; R7
    CMP Rt,Rk,#8 ; R11
    BLE Rt,loop2 -------------------------------------------------------------
    ADD RjN,RjN,#8 ; R4
    CMP Rt,RjN,#64 ; R7
    BLE Rt,Loop1 -------------------------------------------------------------
    RET

    without needing any preserved registers.

    b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
    64 double reads for b. For a and c are 512 reads of doubles each,
    512 doubles are written for c. Total, 1600 memory access for doubles.

    By comparison, the SIMD code reads 192 doubles and writes 64, the
    minimum, for a total of 256. This is a factor of 6.25.

    It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

    ----------------------------------
    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    double c0 = c[0 + j*N];
    double c1 = c[1 + j*N];
    double c2 = c[2 + j*N];
    double c3 = c[3 + j*N];
    double c4 = c[4 + j*N];
    double c5 = c[5 + j*N];
    double c6 = c[6 + j*N];
    double c7 = c[7 + j*N];
    for (int k=0; k<N; k++) {
    double bk = b[k + j*N];
    c0 += a[0 + k*N] * bk;
    c1 += a[1 + k*N] * bk;
    c2 += a[2 + k*N] * bk;
    c3 += a[3 + k*N] * bk;
    c4 += a[4 + k*N] * bk;
    c5 += a[5 + k*N] * bk;
    c6 += a[6 + k*N] * bk;
    c7 += a[7 + k*N] * bk;
    }
    /* write back c0 to c7 */
    }
    }
    }

    where the loop over k could be vectorized, but that would still
    leave eccessive memory traffic for a.

    ENTER Rc1,Rc8,#0 ; preserve c[1..8]
    MOV RjN,#0 ; R4
    loop1:
    LA Rca,[Rc,RjN<<3] ; &c[1..8]
    LDD Rc1,[Rca,#0] ; R23
    LDD Rc2,[Rca,#8]
    LDD Rc3,[Rca,#16]
    LDD Rc4,[Rca,#24]
    LDD Rc5,[Rca,#32]
    LDD Rc6,[Rca,#40]
    LDD Rc7,[Rca,#48]
    LDD Rc8,[Rca,#56] ; R30

    MOV RkN,#0 ; R5
    ---------------begin vectorize-------------------
    VEC 8,{Rc1..Rc8}
    loop2:
    LDD Rbk,[R2,RjN<<3] ; R6

    LA RakN,[Ra,RkN<<3] ; R7
    LDD Ra1,[RakN,#0] ; R8
    FMAC Rc1,Ra1,Rbk,Rc1 ; R23
    LDD Ra2,[RakN,#8] ; R7
    FMAC Rc2,Ra2,Rbk,Rc2 ; R24
    LDD Ra3,[RakN,#16] ; R7
    FMAC Rc3,Ra3,Rbk,Rc3 ; R25
    LDD Ra4,[RakN,#24] ; R7
    FMAC Rc4,Ra4,Rbk,Rc4 ; R26
    LDD Ra5,[RakN,#32] ; R7
    FMAC Rc5,Ra2,Rbk,Rc6 ; R27
    LDD Ra6,[RakN,#40] ; R7
    FMAC Rc6,Ra2,Rbk,Rc6 ; R28
    LDD Ra7,[RakN,#48] ; R7
    FMAC Rc7,Ra2,Rbk,Rc7 ; R29
    LDD Ra8,[RakN,#56] ; R7
    FMAC Rc8,Ra8,Rbk,Rc8 ; R30

    LOOP1 LE,RkN,#8,#64 ; R4
    ---------------end vectorize-------------------

    ADD RkN,RkN,#8 ; R5
    CMP Rt,RkN,$64 ; R6
    BLE Rt,loop1

    STD Rc1,[Rca,#0]
    STD Rc2,[Rca,#8]
    STD Rc3,[Rca,#16]
    STD Rc4,[Rca,#24]
    STD Rc5,[Rca,#32]
    STD Rc6,[Rca,#40]
    STD Rc7,[Rca,#48]
    STD Rc8,[Rca,#56]

    EXIT Rc1,Rc8,#0
    RET

    46 instructions 19 instructions in vectorized (unrolled) loop.

    c[k] is read once and written once
    b[k] is read 8×
    a[k] is read 8×

    If you are willing to have 64 FMACs in a row; a[k] can be read 2×
    {with very tr1cky register allocation}.

    Using this many registers causes 64 bytes to be written to stack
    and read back later. Solving the a[k] traffic increases the stack
    footprint to 104 bytes.

    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

    I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the TT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now. After execution of that instruction,
    which must not be a control transfer instruction, control is returned to
    the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different
    register(s). Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU, thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is,once the Loop
    instruction has been encountered, you know how many of the executed instructions can fir in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in
    this case it eliminates 7/8, i.e. 87.5% of them.

    So overall, I think this idea reduces the memory traffic cost of keeping
    the A matrix in registers by a huge amount. It also eliminates any
    "mucking around" with the OoO mechanism to handle not knowing which
    registers are involved at instruction decode time that my previous idea had.

    As I said, I am sure there are issues with this. I welcome your comments.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon May 25 19:36:14 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:
    ----------------------------
    I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the JTT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now.

    A minor issue is the out-of-sequence Fetch problem this introduces.
    Not un-workable, but an annoyance. Perhaps Call-through-Table would
    work better--let me think on it.

    After execution of that instruction,
    which must not be a control transfer instruction, control is returned to
    the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different register(s).

    This is why a CTT is better, you can perform multiple instructions before returning.

    Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU,

    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an
    index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is, once the Loop instruction has been encountered, you know how many of the executed instructions can fit in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    So overall, I think this idea reduces the memory traffic cost of keeping
    the A matrix in registers by a huge amount.

    The issue is that there can be no-transfers-of-control* out of a vVM
    loop while remaining IN a vVM loop. {This simplified HW by enormous
    amounts. vVM only vectorizes the innermost loop.

    (*) Predicated flow control is allowed, but branches/calls/SVC are not.

    It also eliminates any
    "mucking around" with the OoO mechanism to handle not knowing which registers are involved at instruction decode time that my previous
    idea had.

    So does register[indexing] and out-of-line instruction execution.

    As I said, I am sure there are issues with this. I welcome your comments.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon May 25 17:52:51 2026
    From Newsgroup: comp.arch

    On 5/25/2026 12:36 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:
    ----------------------------
    I have been thinking about this and have come up with another potential
    solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by
    allowing a sort of register indexing equivalent, I think it could make a
    substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the JTT instruction, using the currently unused bit
    combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now.

    A minor issue is the out-of-sequence Fetch problem this introduces.
    Not un-workable, but an annoyance. Perhaps Call-through-Table would
    work better--let me think on it.

    I understand. My rationale for using a different ROB value was to allow
    that functionality to be allowed within a VVM loop, whereas the other
    values invoke a jump or call, which you don't want to allow in VVM. In
    other words an indication that this is a small exception to the no
    control transfer rule and should be allowed. But obviously you
    understand the internals better than I do.





    After execution of that instruction,
    which must not be a control transfer instruction, control is returned to
    the instruction after the TT instruction. This makes the TT instruction
    behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target
    address would be eight FMAC instructions, each using a different
    register(s).

    This is why a CTT is better, you can perform multiple instructions before returning.

    I understand. If you want to go there within VVM fine. I was trying to
    avoid that.





    Of course, I realize that this adds an "extra" instruction
    execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do
    anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU,

    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an
    index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is, once the Loop
    instruction has been encountered, you know how many of the executed
    instructions can fit in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in
    this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    ??? I thought you would fetch the "executed" instruction once and have
    it internally for the next seven loop iterations. But I may be wrong.



    So overall, I think this idea reduces the memory traffic cost of keeping
    the A matrix in registers by a huge amount.

    The issue is that there can be no-transfers-of-control* out of a vVM
    loop while remaining IN a vVM loop. {This simplified HW by enormous
    amounts. vVM only vectorizes the innermost loop.

    I agree. This would have to be an exception, which is why I thought a
    unique value for ROP could indicate that.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue May 26 17:48:56 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 5/25/2026 12:36 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:
    ----------------------------
    I have been thinking about this and have come up with another potential
    solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by
    allowing a sort of register indexing equivalent, I think it could make a >> substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the JTT instruction, using the currently unused bit >> combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now.

    A minor issue is the out-of-sequence Fetch problem this introduces.
    Not un-workable, but an annoyance. Perhaps Call-through-Table would
    work better--let me think on it.

    I understand. My rationale for using a different ROB value was to allow that functionality to be allowed within a VVM loop, whereas the other
    values invoke a jump or call, which you don't want to allow in VVM. In other words an indication that this is a small exception to the no
    control transfer rule and should be allowed. But obviously you
    understand the internals better than I do.





    After execution of that instruction,
    which must not be a control transfer instruction, control is returned to >> the instruction after the TT instruction. This makes the TT instruction >> behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target
    address would be eight FMAC instructions, each using a different
    register(s).

    This is why a CTT is better, you can perform multiple instructions before returning.

    I understand. If you want to go there within VVM fine. I was trying to avoid that.





    Of course, I realize that this adds an "extra" instruction >> execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do >> anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU,

    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is, once the Loop
    instruction has been encountered, you know how many of the executed
    instructions can fit in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in >> this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    ??? I thought you would fetch the "executed" instruction once and have
    it internally for the next seven loop iterations. But I may be wrong.

    Once the loop is in the reservation stations, they stay there for the
    entire execution of the loop! while FETCH remains quiescent.


    So overall, I think this idea reduces the memory traffic cost of keeping >> the A matrix in registers by a huge amount.

    The issue is that there can be no-transfers-of-control* out of a vVM
    loop while remaining IN a vVM loop. {This simplified HW by enormous amounts. vVM only vectorizes the innermost loop.

    I agree. This would have to be an exception, which is why I thought a unique value for ROP could indicate that.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Tue May 26 11:55:11 2026
    From Newsgroup: comp.arch

    On 5/26/2026 10:48 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 5/25/2026 12:36 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:


    snip


    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an
    index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop >>>> without multiple I-cache accesses. So the idea is, once the Loop
    instruction has been encountered, you know how many of the executed
    instructions can fit in the remaining space (in this case, I think all >>>> of them), and where to start them within this memory, (right after the >>>> loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in >>>> this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    ??? I thought you would fetch the "executed" instruction once and have
    it internally for the next seven loop iterations. But I may be wrong.

    Once the loop is in the reservation stations, they stay there for the
    entire execution of the loop! while FETCH remains quiescent.

    Yes. That is what I thought. This would be an exception to that in
    that you would have to fetch and put into the reservation stations, the
    eight instructions pointed to be the TT instruction. It is those eight fetches out of the 64 executions of those instructions that led me to
    the (64-8)/64 = 7/8 reduction in the extra fetches that would otherwise
    be required.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Wed May 27 10:57:01 2026
    From Newsgroup: comp.arch

    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes. That is what I thought. This would be an exception to that in
    that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction. It is those
    eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal. You want a kind of predication
    but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out
    (i.e. is executed) depends on the value in register Rn?


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Wed May 27 11:13:26 2026
    From Newsgroup: comp.arch

    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes. That is what I thought. This would be an exception to that in
    that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction. It is those
    eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal. You want a kind of predication
    but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow what
    you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe. And, of
    course, you don't have to "skip over" the instructions not executed, you
    just use the M value to choose which of the instructions to execute.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Thu May 28 07:43:47 2026
    From Newsgroup: comp.arch

    On 5/27/2026 11:13 AM, Stephen Fuld wrote:
    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes.  That is what I thought.  This would be an exception to that in
    that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction.  It is those
    eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal.  You want a kind of predication
    but instead of being a "predication-style `if`" it's a "predication-style
    `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out
    (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow what
    you call the NATPRED instruction (which is actually a modified TT instruction).  But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe.  And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get
    predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    Thank you Stefan for helping me see this.

    So now the question is, does this save enough to be worth implementing?
    I don't know enough to write prototype code for say MATMUL (8) to see
    how much the savings would be, nor whether this mechanism could help in
    other functions. Mitch? Thomas?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Thu May 28 07:53:59 2026
    From Newsgroup: comp.arch

    On 5/28/2026 7:43 AM, Stephen Fuld wrote:
    On 5/27/2026 11:13 AM, Stephen Fuld wrote:
    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes.  That is what I thought.  This would be an exception to that in >>>> that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction.  It is those >>>> eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal.  You want a kind of predication >>> but instead of being a "predication-style `if`" it's a "predication-
    style
    `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out >>> (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow
    what you call the NATPRED instruction (which is actually a modified TT
    instruction).  But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe.  And, of
    course, you don't have to "skip over" the instructions not executed,
    you just use the M value to choose which of the instructions to execute.

    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code.  This allows the functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception

    Sorry, "extension", not "exception"


    would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc.  I
    don't know how useful this sort of thing would be.

    Thank you Stefan for helping me see this.

    So now the question is, does this save enough to be worth implementing?
    I don't know enough to write prototype code for say MATMUL (8) to see
    how much the savings would be, nor whether this mechanism could help in other functions.  Mitch? Thomas?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu May 28 18:08:09 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 5/27/2026 11:13 AM, Stephen Fuld wrote:
    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes.  That is what I thought.  This would be an exception to that in >>> that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction.  It is those >>> eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal.  You want a kind of predication >> but instead of being a "predication-style `if`" it's a "predication-style >> `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out >> (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction).  But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe.  And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    The PRED instruction (when used in vVM loop} produces a lane mask so
    that different iterations of the loop are executed from the same
    starting time.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    That is what the then-clause and else-clause 'numbers' do--they set
    up the vertical lane-mask for that iteration. And each lane calculates
    its own lane mask.

    Thank you Stefan for helping me see this.

    So now the question is, does this save enough to be worth implementing?
    I don't know enough to write prototype code for say MATMUL (8) to see
    how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Thu May 28 19:29:11 2026
    From Newsgroup: comp.arch

    Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose
    which of the instructions to execute.

    In vVM, all the instructions that make up the loop end up all placed in
    the "dataflow" core once forall after which and they are just triggered
    several times until the loop exit condition is satisfied.

    So "skip over" makes no sense in this context (also because you want
    a much shorter delay between "M is known" and "the corresponding
    instruction is executed", so you have to decode the instruction(s)
    before M is known).

    But instead of skipping, You can "predicate away" the undesirable
    instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Fri May 29 08:50:17 2026
    From Newsgroup: comp.arch

    On 5/28/2026 11:08 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip


    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your
    understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get
    predicated instructions into the reservation stations and eliminates the
    "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    The PRED instruction (when used in vVM loop} produces a lane mask so
    that different iterations of the loop are executed from the same
    starting time.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    That is what the then-clause and else-clause 'numbers' do--they set
    up the vertical lane-mask for that iteration. And each lane calculates
    its own lane mask.

    Yes. But the proposed enhancement to my original proposal gives you
    another way to specify which instructions get executed. Your original
    PRED has a fixed mask to choose which instructions are executed or not,
    based on a binary condition test. This enhancement allows choosing
    which instruction gets executed in each lane based on the value in a
    register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
    "indexed register accesses" at low cost (I believe one extra cycle.)

    My original proposal allows you to execute one instruction based on the
    value in a register, i.e. if the register contains the value 3, then the
    third instruction is the only one executed. The enhanced version allows
    more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use
    case for the basic instruction is clear, emulating register indexing, I
    am not sure there are any use cases for the enhancement.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Fri May 29 09:03:25 2026
    From Newsgroup: comp.arch

    On 5/28/2026 4:29 PM, Stefan Monnier wrote:
    Yes, with the exception that they don't have to immediately follow what you >> call the NATPRED instruction (which is actually a modified TT instruction). >> But once the instructions are loaded into the reservations stations, the
    result is exactly as you describe. And, of course, you don't have to "skip >> over" the instructions not executed, you just use the M value to choose
    which of the instructions to execute.

    In vVM, all the instructions that make up the loop end up all placed in
    the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

    Yes. Note that in an earlier post, I changed my proposal to be more in
    line with your ideas and the original pred - specifically, the
    instructions to be conditionally executed are physically placed inline,
    after the NATPRED.



    So "skip over" makes no sense in this context (also because you want
    a much shorter delay between "M is known" and "the corresponding
    instruction is executed", so you have to decode the instruction(s)
    before M is known).

    My exposition was sloppy. :-(


    But instead of skipping, You can "predicate away" the undesirable instructions.

    Agree. And clearer exposition.

    So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).

    Yes, size is clearly a limitation. I think it works for MATMUL (8), but probably not for MATMUL (16). I don't know enough to know how practical
    it would be to allow more than the current 32 instructions within a VVM
    loop. As for performance, I expect (hope) that instructions other than
    the one whose position matches the register value will not be executed.
    If that can be done, then the extra cost is presumably one cycle, and it
    saves (in the MATUL (8) example), executing eight load instructions.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri May 29 16:55:38 2026
    From Newsgroup: comp.arch


    Stefan Monnier <[email protected]> posted:

    Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

    In vVM, all the instructions that make up the loop end up all placed in
    the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

    Correct.

    So "skip over" makes no sense in this context (also because you want
    a much shorter delay between "M is known" and "the corresponding
    instruction is executed", so you have to decode the instruction(s)
    before M is known).

    Also Correct.

    But instead of skipping, You can "predicate away" the undesirable instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).

    If the predicated instructions are used in at least 1 iteration, they
    are not useless.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri May 29 16:58:42 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 5/28/2026 11:08 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip


    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your
    understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get
    predicated instructions into the reservation stations and eliminates the >> "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    The PRED instruction (when used in vVM loop} produces a lane mask so
    that different iterations of the loop are executed from the same
    starting time.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    That is what the then-clause and else-clause 'numbers' do--they set
    up the vertical lane-mask for that iteration. And each lane calculates
    its own lane mask.

    Yes. But the proposed enhancement to my original proposal gives you
    another way to specify which instructions get executed. Your original
    PRED has a fixed mask to choose which instructions are executed or not, based on a binary condition test. This enhancement allows choosing
    which instruction gets executed in each lane based on the value in a register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
    "indexed register accesses" at low cost (I believe one extra cycle.)

    It is the latency of PARSDE+DECODE

    My original proposal allows you to execute one instruction based on the value in a register, i.e. if the register contains the value 3, then the third instruction is the only one executed. The enhanced version allows more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use case for the basic instruction is clear, emulating register indexing, I
    am not sure there are any use cases for the enhancement.

    Not sure what the source code would look like in order for the compiler
    to recognize this pattern and optimize to your solution.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri May 29 17:04:03 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 5/28/2026 4:29 PM, Stefan Monnier wrote:
    ----------------------
    So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).

    Yes, size is clearly a limitation.

    A 6-wide × 16-deep execution window would allow between 90-and-96 instructions.

    I think it works for MATMUL (8), but probably not for MATMUL (16).

    You run out of registers anyway.

    I don't know enough to know how practical
    it would be to allow more than the current 32 instructions within a VVM loop.

    It makes vVM harder for smaller machines.

    Oh and BTW, forward FFT is 27 instructions/butterfly iteration.

    As for performance, I expect (hope) that instructions other than
    the one whose position matches the register value will not be executed.
    If that can be done, then the extra cost is presumably one cycle, and it saves (in the MATUL (8) example), executing eight load instructions.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Jun 1 15:45:37 2026
    From Newsgroup: comp.arch

    On 5/29/2026 9:58 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    My original proposal allows you to execute one instruction based on the
    value in a register, i.e. if the register contains the value 3, then the
    third instruction is the only one executed. The enhanced version allows
    more flexibility. For example, you could allow to specify execute all
    instructions up to the number in the register. As I said, while the use
    case for the basic instruction is clear, emulating register indexing, I
    am not sure there are any use cases for the enhancement.

    Not sure what the source code would look like in order for the compiler
    to recognize this pattern and optimize to your solution.

    Good question. I have thought about it for a while, and though I am far
    from a compiler expert, I have come up with a potential solution at
    least for the basic proposal.

    The idea is if the compiler sees a SWITCH statement where the clauses
    that are switched to will compile to one instruction (not counting any instructions need for array addressing (which would be handled outside
    the SWITCH, e.g. a loop counter), then it could emit the PREDNAT (or
    whatever name is better) followed by the single instructions for each
    clause. I am sure this needs more specificity, but I hope you get the idea.

    I am still not sure of the benefit of the "enhanced" instruction, and
    haven't come up with any reasonable source code that would benefit from it.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Mon Jun 1 14:58:35 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-05-29 16:55:38] wrote:
    Stefan Monnier <[email protected]> posted:
    But instead of skipping, You can "predicate away" the undesirable
    instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).
    If the predicated instructions are used in at least 1 iteration, they
    are not useless.

    They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
    gives a "50% waste" (for equal size branches or when each branch is
    taken as often as the other), whereas a predicated `switch` results in
    a waste of `N-1/N`. As N grows larger this becomes discouraging.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Jun 1 23:40:38 2026
    From Newsgroup: comp.arch

    On 6/1/2026 11:58 AM, Stefan Monnier wrote:
    MitchAlsup [2026-05-29 16:55:38] wrote:
    Stefan Monnier <[email protected]> posted:
    But instead of skipping, You can "predicate away" the undesirable
    instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large >>> for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).
    If the predicated instructions are used in at least 1 iteration, they
    are not useless.

    They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
    gives a "50% waste" (for equal size branches or when each branch is
    taken as often as the other), whereas a predicated `switch` results in
    a waste of `N-1/N`. As N grows larger this becomes discouraging.

    OK, but what exactly are we wasting? We are taking space in the
    reservation stations, but we are saving executing actual load
    instructions. So the "waste" results in faster execution.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2