Forum: War Ensemble BBS

Re: Matmul in VVM

From scott@[email protected] (Scott Lurndal) to comp.arch on Fri May 15 22:30:26 2026

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size. A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sat May 16 10:22:34 2026

From Newsgroup: comp.arch

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.

That could compete with cache, and still cause memory traffic.
I am not sure how this would compare with just loading the
values into the cache on the first iteration.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat May 16 18:09:17 2026

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> posted:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work? >>>

No. In the vast majority of cases, you reference registers as you do >>> now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to >>> use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.

A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.

That could compete with cache, and still cause memory traffic.
I am not sure how this would compare with just loading the
values into the cache on the first iteration.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sat May 16 18:11:56 2026

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> schrieb:

Thomas Koenig <[email protected]> posted:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work? >> >>>

No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But >> >>> you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.

Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat May 16 22:59:22 2026

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

Thomas Koenig <[email protected]> posted:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >> >>>>> "index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.

Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).

How does one {programmer or OS} glean that the bit can be set ??
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sun May 17 07:51:02 2026

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> schrieb:

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

Thomas Koenig <[email protected]> posted:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >> >> >>>>> "index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.

Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).

How does one {programmer or OS} glean that the bit can be set ??

The OS could learn by special argument to mmap(), for example.

ABIs could specify a second stack for local variables which are
known, by language rules, not to be accessed by other threads -
an alloca-version, for example.

Renaming could then be done relative to that second stack pointer.

Drawback: This would increase calling overhead.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@[email protected] (Scott Lurndal) to comp.arch on Sun May 17 18:51:12 2026

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work? >>>>

No. In the vast majority of cases, you reference registers as you do >>>> now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to >>>> use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.

Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.

That could compete with cache, and still cause memory traffic.

The OS can designate that page as 'noncacheble', so no
coherency traffic necessary. It would simply be a faster
page of memory, with access times closer to cache than DRAM
and shared by multiple cores (with appropriate software care).

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Mon May 18 09:56:04 2026

From Newsgroup: comp.arch

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work? >>>>>

No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be >>>>> used in rare circumstances.

Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.

That could compete with cache, and still cause memory traffic.

The OS can designate that page as 'noncacheble', so no
coherency traffic necessary. It would simply be a faster
page of memory, with access times closer to cache than DRAM
and shared by multiple cores (with appropriate software care).

That is of course a possibility.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon May 18 17:50:49 2026

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> posted:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Thomas Koenig <[email protected]> writes:

Stephen Fuld <[email protected]d> schrieb:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the >>>>>>> registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work? >>>>>

No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.

Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the >>>>stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.

Such a region should probably be page-aligned and sized
to an integral multiple of the page size.

Agreed. A "local thread only" flag could then be set in a
page table.

A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.

That could compete with cache, and still cause memory traffic.

The OS can designate that page as 'noncacheble', so no
coherency traffic necessary.

The uncacheable page should not show up in any cache; and on most
machines travels around the system in data-unit-sizes rather than
cache-line sizes.

It would simply be a faster

page of memory, with access times closer to cache than DRAM

I cannot see how an uncacheable unit of data can approach L1 cache
latency.

and shared by multiple cores (with appropriate software care).

That is of course a possibility.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon May 18 11:02:17 2026

From Newsgroup: comp.arch

On 5/2/2026 11:46 AM, MitchAlsup wrote:

big snip

Thomas Koenig <[email protected]> posted:

One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).

The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

Plus, the setup time for VVM...

I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.

Any progress?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon May 18 20:20:54 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 5/2/2026 11:46 AM, MitchAlsup wrote:

big snip

Thomas Koenig <[email protected]> posted:

One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).

The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

Plus, the setup time for VVM...

I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.

Any progress?

A bit.

To recover from interrupts while performing multi-memory operation*,
there is a count register (line aligned) in Thread.Header. By using
this register instead of the Rd supplied by VEC, exceptions and
interrupts can be recovered--leaving me 5-bits to more fully express
VEC functionality.

(*) MM {memory to memory move} and MS {memory set}

I was thinking of using some of Rd's bits to describe the width of the
loop in lanes.

By using 0 to mean "as many as you have" and other numbers to indirectly specify a loop-recurrence that prevents running wider than Rd used as
an immediate. Thus, if the compiler found a recurrence preventing width
it is expressed and the HW does not have to go looking {simplifying
DECODE a bit}.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon May 18 14:18:17 2026

From Newsgroup: comp.arch

On 5/14/2026 9:19 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-11 23:11:07] wrote:

Let me give one possible implementation. There are certainly others. Say
you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
virtual addressing mechanism would have to be sensitive to whether the
address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.

That solves the problem of encoding an indirect register access as
a LD/ST instruction, but I highly doubt that's the main problem
introduced by indirect register access.

It'd actually be easier to just add a new instruction for indirect
register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).

Fair enough. I was motivated by saving an op code. But the confusion
that has generated, has led me to agree with you about using new op
codes. But a note - I was assuming it wouldn't actually be executed by
the load/store unit - the use of load/store was "syntactical sugar"

The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
which register it needs to read, we're in the middle of the OoO engine,
and the first thing it needs to do is to figure out which physical
register corresponds to this logical register (and it needs to find out
also if that physical register's value has already been delivered).
The needed information is definitely out there somewhere in the CPU,
but I'm not sure it can be made available cheaply at that time&place.

Good point. I have some ideas about how to do it, but they are not
cheap. :-(. But if the savings in a common application of VVM is big
enough it might be worth it. I just don't know.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon May 18 14:22:25 2026

From Newsgroup: comp.arch

On 5/14/2026 5:17 PM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

On 5/14/2026 3:03 PM, Bernd Linsel wrote:

On 5/13/26 22:52, MitchAlsup wrote:

Bernd Linsel <[email protected]> posted:

On 5/13/26 14:02, Bernd Linsel wrote:

Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, >>>>>> accessible in 1 or 2 clocks, and two transfer instructions

ldqr Rd, <index>
stqr Rd, <index>

This should work our perfectly even in a tight vVM loop.

Should of course read

ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM

I think "direct addressing" with an immediate index instead of via an >>>>> index in a register is not needed.

How do you access a different register each loop iteration ???
if you don't have indexing ???

It's meant as:

ld Rd, qregs[Rd] and
st Rs1, qregs[Rs2],

OK, that solves the indexing issue.

i.e. the second register as index into the "quick regs" local SRAM bank, >>> Only aligned full word access possible should be sufficient, so that
these are really indices, not addresses.

I must be missing something. Doesn't this quick regs memory have to be
saved and restored on each context switch? If so, that is very expensive.

qregs[] is (IS) the actual register file (or files)--so, no added state.

Huh? In Bernd's post above, he expressly says adding a 4K fast SRAM to
the core. I don't think he was talking about the register file.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Bernd Linsel@[email protected] to comp.arch on Wed May 20 10:06:24 2026

From Newsgroup: comp.arch

On 5/18/26 23:22, Stephen Fuld wrote:

On 5/14/2026 5:17 PM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

I must be missing something. Doesn't this quick regs memory have to be >>> saved and restored on each context switch? If so, that is very
expensive.

qregs[] is (IS) the actual register file (or files)--so, no added state.

Huh? In Bernd's post above, he expressly says adding a 4K fast SRAM to
the core. I don't think he was talking about the register file.

Correct, I meant the "qregs" as additional memory, not as aliases for
existing registers. This does add a considerable amount additional
state, and the only solution not to thwart quick context switches with
for most threads unnecessary state, one would have to add support for
lazy save/restore on first access, i.e. an additional status bit "qregs
valid" that is reset with every context switch, and trap every access to qregs[] while the qregs valid flag is unset.

<s>Another optimization is to keep a score which qregs have been used (written) by a thread at all, and to only save these. To mitigate data
leaking between threads, all never written qregs must return 0 or raise
an access violation. But this adds again a lot of state to the thread to
be saved and restored. Furthermore, the necessary access logic delays
access times and thus foils the original purpose of qregs[].</s>
--
Bernd Linsel

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon May 25 12:00:52 2026

From Newsgroup: comp.arch

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

C version loop invariant, and cursoring

#define N 8
void mm8(double *a, double *b, double *c)
{
int i,j,jN,k,kN;
double *AcijN,*AbkjN,*AaijN;

for( jN=0; jN<N*N; jN+=N ) {
AcijN = &c[jN];
AbkjN = &b[jN];
for( kN=k=0; k<N; k++,kN+=N ) {
AaikN = &a[kN];
bN = AbkjN[k];
for( i=0; i<N; i++ ) {
AcijN[i] += AaikN[i] * bN;
}
}
}
}

I did get this into:

mm8:
; R1 = &a[0];
; R2 = &b[0];
; R3 = &c[0]; -------------------------------------------------------------
MOV RjN,#0 ; R4
loop1:
LA RcijN,[Rc,RjN<<3] ; R5
LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
MOV RkN,#0 ; R7
MOV Rk,#0 ; R8
loop2:
LA RaikN,[Ra,RkN<<3] ; R9
LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
MOV Ri,#0 ; R11
VEC 8,{}
loop3:
LDD Ra,[RaikN,Ri<<3] ; R12
LDD Rc,[RcijN,Ri<<3] ; R13
FMAC Rc,Ra,Rb,Rc ; R14
STD Rc,[RcijN,Ri<<3] ;

LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
ADD Rk,Rk,#1 ; R8
ADD RkN,RkN,#8 ; R7
CMP Rt,Rk,#8 ; R11
BLE Rt,loop2 -------------------------------------------------------------
ADD RjN,RjN,#8 ; R4
CMP Rt,RjN,#64 ; R7
BLE Rt,Loop1 -------------------------------------------------------------
RET

without needing any preserved registers.

b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.

By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.

It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

----------------------------------

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
double c0 = c[0 + j*N];
double c1 = c[1 + j*N];
double c2 = c[2 + j*N];
double c3 = c[3 + j*N];
double c4 = c[4 + j*N];
double c5 = c[5 + j*N];
double c6 = c[6 + j*N];
double c7 = c[7 + j*N];
for (int k=0; k<N; k++) {
double bk = b[k + j*N];
c0 += a[0 + k*N] * bk;
c1 += a[1 + k*N] * bk;
c2 += a[2 + k*N] * bk;
c3 += a[3 + k*N] * bk;
c4 += a[4 + k*N] * bk;
c5 += a[5 + k*N] * bk;
c6 += a[6 + k*N] * bk;
c7 += a[7 + k*N] * bk;
}
/* write back c0 to c7 */
}
}
}

where the loop over k could be vectorized, but that would still
leave eccessive memory traffic for a.

ENTER Rc1,Rc8,#0 ; preserve c[1..8]
MOV RjN,#0 ; R4
loop1:
LA Rca,[Rc,RjN<<3] ; &c[1..8]
LDD Rc1,[Rca,#0] ; R23
LDD Rc2,[Rca,#8]
LDD Rc3,[Rca,#16]
LDD Rc4,[Rca,#24]
LDD Rc5,[Rca,#32]
LDD Rc6,[Rca,#40]
LDD Rc7,[Rca,#48]
LDD Rc8,[Rca,#56] ; R30

MOV RkN,#0 ; R5
---------------begin vectorize-------------------
VEC 8,{Rc1..Rc8}
loop2:
LDD Rbk,[R2,RjN<<3] ; R6

LA RakN,[Ra,RkN<<3] ; R7
LDD Ra1,[RakN,#0] ; R8
FMAC Rc1,Ra1,Rbk,Rc1 ; R23
LDD Ra2,[RakN,#8] ; R7
FMAC Rc2,Ra2,Rbk,Rc2 ; R24
LDD Ra3,[RakN,#16] ; R7
FMAC Rc3,Ra3,Rbk,Rc3 ; R25
LDD Ra4,[RakN,#24] ; R7
FMAC Rc4,Ra4,Rbk,Rc4 ; R26
LDD Ra5,[RakN,#32] ; R7
FMAC Rc5,Ra2,Rbk,Rc6 ; R27
LDD Ra6,[RakN,#40] ; R7
FMAC Rc6,Ra2,Rbk,Rc6 ; R28
LDD Ra7,[RakN,#48] ; R7
FMAC Rc7,Ra2,Rbk,Rc7 ; R29
LDD Ra8,[RakN,#56] ; R7
FMAC Rc8,Ra8,Rbk,Rc8 ; R30

LOOP1 LE,RkN,#8,#64 ; R4
---------------end vectorize-------------------

ADD RkN,RkN,#8 ; R5
CMP Rt,RkN,$64 ; R6
BLE Rt,loop1

STD Rc1,[Rca,#0]
STD Rc2,[Rca,#8]
STD Rc3,[Rca,#16]
STD Rc4,[Rca,#24]
STD Rc5,[Rca,#32]
STD Rc6,[Rca,#40]
STD Rc7,[Rca,#48]
STD Rc8,[Rca,#56]

EXIT Rc1,Rc8,#0
RET

46 instructions 19 instructions in vectorized (unrolled) loop.

c[k] is read once and written once
b[k] is read 8×
a[k] is read 8×

If you are willing to have 64 FMACs in a row; a[k] can be read 2×
{with very tr1cky register allocation}.

Using this many registers causes 64 bytes to be written to stack
and read back later. Solving the a[k] traffic increases the stack
footprint to 104 bytes.

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the TT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now. After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different
register(s). Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU, thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is,once the Loop
instruction has been encountered, you know how many of the executed instructions can fir in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in
this case it eliminates 7/8, i.e. 87.5% of them.

So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount. It also eliminates any
"mucking around" with the OoO mechanism to handle not knowing which
registers are involved at instruction decode time that my previous idea had.

As I said, I am sure there are issues with this. I welcome your comments.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon May 25 19:36:14 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

----------------------------

I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the JTT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.

A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.

After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different register(s).

This is why a CTT is better, you can perform multiple instructions before returning.

Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop instruction has been encountered, you know how many of the executed instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount.

The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous
amounts. vVM only vectorizes the innermost loop.

(*) Predicated flow control is allowed, but branches/calls/SVC are not.

It also eliminates any
"mucking around" with the OoO mechanism to handle not knowing which registers are involved at instruction decode time that my previous
idea had.

So does register[indexing] and out-of-line instruction execution.

As I said, I am sure there are issues with this. I welcome your comments.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon May 25 17:52:51 2026

From Newsgroup: comp.arch

On 5/25/2026 12:36 PM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

----------------------------

I have been thinking about this and have come up with another potential
solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by
allowing a sort of register indexing equivalent, I think it could make a
substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the JTT instruction, using the currently unused bit
combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.

A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.

I understand. My rationale for using a different ROB value was to allow
that functionality to be allowed within a VVM loop, whereas the other
values invoke a jump or call, which you don't want to allow in VVM. In
other words an indication that this is a small exception to the no
control transfer rule and should be allowed. But obviously you
understand the internals better than I do.

After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction
behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target
address would be eight FMAC instructions, each using a different
register(s).

This is why a CTT is better, you can perform multiple instructions before returning.

I understand. If you want to go there within VVM fine. I was trying to
avoid that.

Of course, I realize that this adds an "extra" instruction
execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do
anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in
this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.

So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount.

The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous
amounts. vVM only vectorizes the innermost loop.

I agree. This would have to be an exception, which is why I thought a
unique value for ROP could indicate that.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue May 26 17:48:56 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 5/25/2026 12:36 PM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

----------------------------

I have been thinking about this and have come up with another potential
solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by
allowing a sort of register indexing equivalent, I think it could make a >> substantial reduction in the memory traffic. There are two parts to
this proposal.

First, is to enhance the JTT instruction, using the currently unused bit >> combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.

A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.

I understand. My rationale for using a different ROB value was to allow that functionality to be allowed within a VVM loop, whereas the other
values invoke a jump or call, which you don't want to allow in VVM. In other words an indication that this is a small exception to the no
control transfer rule and should be allowed. But obviously you
understand the internals better than I do.

After execution of that instruction,
which must not be a control transfer instruction, control is returned to >> the instruction after the TT instruction. This makes the TT instruction >> behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target
address would be eight FMAC instructions, each using a different
register(s).

This is why a CTT is better, you can perform multiple instructions before returning.

I understand. If you want to go there within VVM fine. I was trying to avoid that.

Of course, I realize that this adds an "extra" instruction >> execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do >> anything about extra instruction execution, but see below.

So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in >> this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.

Once the loop is in the reservation stations, they stay there for the
entire execution of the loop! while FETCH remains quiescent.

So overall, I think this idea reduces the memory traffic cost of keeping >> the A matrix in registers by a huge amount.

The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous amounts. vVM only vectorizes the innermost loop.

I agree. This would have to be an exception, which is why I thought a unique value for ROP could indicate that.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Tue May 26 11:55:11 2026

From Newsgroup: comp.arch

On 5/26/2026 10:48 AM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

On 5/25/2026 12:36 PM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

On 5/3/2026 3:28 PM, MitchAlsup wrote:

snip

A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.

thus allowing multiple iterations of the loop >>>> without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all >>>> of them), and where to start them within this memory, (right after the >>>> loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in >>>> this case it eliminates 7/8, i.e. 87.5% of them.

Eliminates = 1.0-1.0/(loop_count)

??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.

Once the loop is in the reservation stations, they stay there for the
entire execution of the loop! while FETCH remains quiescent.

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations, the
eight instructions pointed to be the TT instruction. It is those eight fetches out of the 64 executions of those instructions that led me to
the (64-8)/64 = 7/8 reduction in the extra fetches that would otherwise
be required.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@[email protected] to comp.arch on Wed May 27 10:57:01 2026

From Newsgroup: comp.arch

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out
(i.e. is executed) depends on the value in register Rn?

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Wed May 27 11:13:26 2026

From Newsgroup: comp.arch

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out (i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow what
you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of
course, you don't have to "skip over" the instructions not executed, you
just use the M value to choose which of the instructions to execute.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Thu May 28 07:43:47 2026

From Newsgroup: comp.arch

On 5/27/2026 11:13 AM, Stephen Fuld wrote:

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style
`case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out
(i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow what
you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

Thank you Stefan for helping me see this.

So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in
other functions. Mitch? Thomas?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Thu May 28 07:53:59 2026

From Newsgroup: comp.arch

On 5/28/2026 7:43 AM, Stephen Fuld wrote:

On 5/27/2026 11:13 AM, Stephen Fuld wrote:

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in >>>> that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those >>>> eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication >>> but instead of being a "predication-style `if`" it's a "predication-
style
`case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out >>> (i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow
what you call the NATPRED instruction (which is actually a modified TT
instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of
course, you don't have to "skip over" the instructions not executed,
you just use the M value to choose which of the instructions to execute.

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception

Sorry, "extension", not "exception"

would be to allow a

third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

Thank you Stefan for helping me see this.

So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Thu May 28 18:08:09 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 5/27/2026 11:13 AM, Stephen Fuld wrote:

On 5/27/2026 7:57 AM, Stefan Monnier wrote:

Stephen Fuld [2026-05-26 11:55:11] wrote:

Yes. That is what I thought. This would be an exception to that in >>> that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those >>> eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.

Oh, I think I understand your proposal. You want a kind of predication >> but instead of being a "predication-style `if`" it's a "predication-style >> `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out >> (i.e. is executed) depends on the value in register Rn?

Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.

Thank you Stefan for helping me see this.

So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@[email protected] to comp.arch on Thu May 28 19:29:11 2026

From Newsgroup: comp.arch

Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose
which of the instructions to execute.

In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered
several times until the loop exit condition is satisfied.

So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).

But instead of skipping, You can "predicate away" the undesirable
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Fri May 29 08:50:17 2026

From Newsgroup: comp.arch

On 5/28/2026 11:08 AM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

snip

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your
understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the
"jump" characteristic which required an exception to the no jump within
a VVM loop rule.

The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.

Yes. But the proposed enhancement to my original proposal gives you
another way to specify which instructions get executed. Your original
PRED has a fixed mask to choose which instructions are executed or not,
based on a binary condition test. This enhancement allows choosing
which instruction gets executed in each lane based on the value in a
register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
"indexed register accesses" at low cost (I believe one extra cycle.)

My original proposal allows you to execute one instruction based on the
value in a register, i.e. if the register contains the value 3, then the
third instruction is the only one executed. The enhanced version allows
more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use
case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Fri May 29 09:03:25 2026

From Newsgroup: comp.arch

On 5/28/2026 4:29 PM, Stefan Monnier wrote:

Yes, with the exception that they don't have to immediately follow what you >> call the NATPRED instruction (which is actually a modified TT instruction). >> But once the instructions are loaded into the reservations stations, the
result is exactly as you describe. And, of course, you don't have to "skip >> over" the instructions not executed, you just use the M value to choose
which of the instructions to execute.

In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

Yes. Note that in an earlier post, I changed my proposal to be more in
line with your ideas and the original pred - specifically, the
instructions to be conditionally executed are physically placed inline,
after the NATPRED.

So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).

My exposition was sloppy. :-(

But instead of skipping, You can "predicate away" the undesirable instructions.

Agree. And clearer exposition.

So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

Yes, size is clearly a limitation. I think it works for MATMUL (8), but probably not for MATMUL (16). I don't know enough to know how practical
it would be to allow more than the current 32 instructions within a VVM
loop. As for performance, I expect (hope) that instructions other than
the one whose position matches the register value will not be executed.
If that can be done, then the extra cost is presumably one cycle, and it
saves (in the MATUL (8) example), executing eight load instructions.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri May 29 16:55:38 2026

From Newsgroup: comp.arch

Stefan Monnier <[email protected]> posted:

Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

Correct.

So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).

Also Correct.

But instead of skipping, You can "predicate away" the undesirable instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

If the predicated instructions are used in at least 1 iteration, they
are not useless.

=== Stefan

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri May 29 16:58:42 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 5/28/2026 11:08 AM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

snip

Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your
understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the >> "jump" characteristic which required an exception to the no jump within
a VVM loop rule.

The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.

And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.

That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.

Yes. But the proposed enhancement to my original proposal gives you
another way to specify which instructions get executed. Your original
PRED has a fixed mask to choose which instructions are executed or not, based on a binary condition test. This enhancement allows choosing
which instruction gets executed in each lane based on the value in a register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
"indexed register accesses" at low cost (I believe one extra cycle.)

It is the latency of PARSDE+DECODE

My original proposal allows you to execute one instruction based on the value in a register, i.e. if the register contains the value 3, then the third instruction is the only one executed. The enhanced version allows more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.

Not sure what the source code would look like in order for the compiler
to recognize this pattern and optimize to your solution.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri May 29 17:04:03 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 5/28/2026 4:29 PM, Stefan Monnier wrote:

----------------------

So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

Yes, size is clearly a limitation.

A 6-wide × 16-deep execution window would allow between 90-and-96 instructions.

I think it works for MATMUL (8), but probably not for MATMUL (16).

You run out of registers anyway.

I don't know enough to know how practical

it would be to allow more than the current 32 instructions within a VVM loop.

It makes vVM harder for smaller machines.

Oh and BTW, forward FFT is 27 instructions/butterfly iteration.

As for performance, I expect (hope) that instructions other than
the one whose position matches the register value will not be executed.
If that can be done, then the extra cost is presumably one cycle, and it saves (in the MATUL (8) example), executing eight load instructions.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon Jun 1 15:45:37 2026

From Newsgroup: comp.arch

On 5/29/2026 9:58 AM, MitchAlsup wrote:

Stephen Fuld <[email protected]d> posted:

snip

My original proposal allows you to execute one instruction based on the
value in a register, i.e. if the register contains the value 3, then the
third instruction is the only one executed. The enhanced version allows
more flexibility. For example, you could allow to specify execute all
instructions up to the number in the register. As I said, while the use
case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.

Not sure what the source code would look like in order for the compiler
to recognize this pattern and optimize to your solution.

Good question. I have thought about it for a while, and though I am far
from a compiler expert, I have come up with a potential solution at
least for the basic proposal.

The idea is if the compiler sees a SWITCH statement where the clauses
that are switched to will compile to one instruction (not counting any instructions need for array addressing (which would be handled outside
the SWITCH, e.g. a loop counter), then it could emit the PREDNAT (or
whatever name is better) followed by the single instructions for each
clause. I am sure this needs more specificity, but I hope you get the idea.

I am still not sure of the benefit of the "enhanced" instruction, and
haven't come up with any reasonable source code that would benefit from it.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@[email protected] to comp.arch on Mon Jun 1 14:58:35 2026

From Newsgroup: comp.arch

MitchAlsup [2026-05-29 16:55:38] wrote:

Stefan Monnier <[email protected]> posted:

But instead of skipping, You can "predicate away" the undesirable
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

If the predicated instructions are used in at least 1 iteration, they
are not useless.

They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
gives a "50% waste" (for equal size branches or when each branch is
taken as often as the other), whereas a predicated `switch` results in
a waste of `N-1/N`. As N grows larger this becomes discouraging.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon Jun 1 23:40:38 2026

From Newsgroup: comp.arch

On 6/1/2026 11:58 AM, Stefan Monnier wrote:

MitchAlsup [2026-05-29 16:55:38] wrote:

Stefan Monnier <[email protected]> posted:

But instead of skipping, You can "predicate away" the undesirable
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large >>> for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).

If the predicated instructions are used in at least 1 iteration, they
are not useless.

They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
gives a "50% waste" (for equal size branches or when each branch is
taken as often as the other), whereas a predicated `switch` results in
a waste of `N-1/N`. As N grows larger this becomes discouraging.

OK, but what exactly are we wasting? We are taking space in the
reservation stations, but we are saving executing actual load
instructions. So the "waste" results in faster execution.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,123
Nodes:	10 (0 / 10)
Uptime:	34:33:27
Calls:	14,371
Files:	186,380
D/L today:	1,058 files (298M bytes)
Messages:	2,540,615

Re: Matmul in VVM

Who's Online

System Info