Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>> now, with register numbers in assigned places in the instruction. But >>> you do have an "alternate" way of referencing them that allows you to >>> use an index, just as you can with memory. That mechanism would only be >>> used in rare circumstances.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>> "index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work? >>>
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
I am not sure how this would compare with just loading the
values into the cache on the first iteration.
Thomas Koenig <[email protected]> posted:
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But >> >>> you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work? >> >>>
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.
MitchAlsup <[email protected]d> schrieb:
Thomas Koenig <[email protected]> posted:
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >> >>>>> "index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do >> >>> now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to >> >>> use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.
Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).
Thomas Koenig <[email protected]> posted:
MitchAlsup <[email protected]d> schrieb:How does one {programmer or OS} glean that the bit can be set ??
Thomas Koenig <[email protected]> posted:
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >> >> >>>>> "index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
That would prevent thread[k] from allowing thread[j] access to its
thread local store via shared pointer.
Not for all fo the thread's memory, I was thinking of this as a
separate flag, to be set only for special purposes (such as above).
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>>> now, with register numbers in assigned places in the instruction. But >>>> you do have an "alternate" way of referencing them that allows you to >>>> use an index, just as you can with memory. That mechanism would only be >>>> used in rare circumstances.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>> "index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work? >>>>
Maybe one way to implement this would be to treat a special region,
like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be >>>>> used in rare circumstances.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work? >>>>>
Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the
stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
The OS can designate that page as 'noncacheble', so no
coherency traffic necessary. It would simply be a faster
page of memory, with access times closer to cache than DRAM
and shared by multiple cores (with appropriate software care).
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Thomas Koenig <[email protected]> writes:
Stephen Fuld <[email protected]d> schrieb:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:No. In the vast majority of cases, you reference registers as you do >>>>> now, with register numbers in assigned places in the instruction. But >>>>> you do have an "alternate" way of referencing them that allows you to >>>>> use an index, just as you can with memory. That mechanism would only be
A possible alternative that I have seen is to "memory map" the >>>>>>> registers as an alternative accessing mechanism. This allows you to >>>>>>> "index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work? >>>>>
used in rare circumstances.
Maybe one way to implement this would be to treat a special region, >>>>like local variable addressed in a certain range relative to the >>>>stack pointer or some other register as something that the CPU
can treat as if it were a register, and include in its renaming.
Such a region should probably be page-aligned and sized
to an integral multiple of the page size.
Agreed. A "local thread only" flag could then be set in a
page table.
A certain portion
of the virtual address space could be then mapped to, for example,
a 4KB bank of high-speed SRAM.
That could compete with cache, and still cause memory traffic.
The OS can designate that page as 'noncacheble', so no
coherency traffic necessary.
It would simply be a faster
page of memory, with access times closer to cache than DRAM
and shared by multiple cores (with appropriate software care).
That is of course a possibility.
Thomas Koenig <[email protected]> posted:
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Plus, the setup time for VVM...
I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.
On 5/2/2026 11:46 AM, MitchAlsup wrote:
big snip
Thomas Koenig <[email protected]> posted:
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Plus, the setup time for VVM...
I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.
Any progress?
Stephen Fuld [2026-05-11 23:11:07] wrote:
Let me give one possible implementation. There are certainly others. Say
you have 32 registers. They are "memory mapped" into the first 32 addresses >> of memory. So programs would have to start not at zero, but at 32 (I know >> this can cause other problems - I clearly have not thought through all of
the details.) So now when the CPU encounters a load (or store) instruction >> where the virtual address is less than 32, it is resolved not by the memory >> system, but by the appropriate register. i.e. if the virtual address was say >> 4, the load would be from register R4, not memory location 4. Yes, the
virtual addressing mechanism would have to be sensitive to whether the
address was below 32 or not, but that is simple within the CPU. Note that >> the load instruction in this case would not touch the memory system at all, >> so no cache lookups, no TLB lookups, etc.
That solves the problem of encoding an indirect register access as
a LD/ST instruction, but I highly doubt that's the main problem
introduced by indirect register access.
It'd actually be easier to just add a new instruction for indirect
register access (no need to burden the load/store unit, no need to worry about access size and alignment, memory remapping, and whatnot).
The implementation problem, AFAIK comes in with OoO: by the time your instruction (whether a load or a dedicated instruction) gets to know
which register it needs to read, we're in the middle of the OoO engine,
and the first thing it needs to do is to figure out which physical
register corresponds to this logical register (and it needs to find out
also if that physical register's value has already been delivered).
The needed information is definitely out there somewhere in the CPU,
but I'm not sure it can be made available cheaply at that time&place.
Stephen Fuld <[email protected]d> posted:
On 5/14/2026 3:03 PM, Bernd Linsel wrote:
On 5/13/26 22:52, MitchAlsup wrote:
Bernd Linsel <[email protected]> posted:
On 5/13/26 14:02, Bernd Linsel wrote:
Add e.g. 4KB (= 512 64bit "registers") ultra-fast SRAM to the core, >>>>>> accessible in 1 or 2 clocks, and two transfer instructions
ldqr Rd, <index>
stqr Rd, <index>
This should work our perfectly even in a tight vVM loop.
Should of course read
ldqr Rd, Rs // Rs indexes into ultra-fast on-chip SRAM
stqr Rs1, Rs2 // Rs2 indexes into ultra-fast on-chip SRAM
I think "direct addressing" with an immediate index instead of via an >>>>> index in a register is not needed.
How do you access a different register each loop iteration ???
if you don't have indexing ???
It's meant as:
ld Rd, qregs[Rd] and
st Rs1, qregs[Rs2],
OK, that solves the indexing issue.
i.e. the second register as index into the "quick regs" local SRAM bank, >>> Only aligned full word access possible should be sufficient, so that
these are really indices, not addresses.
I must be missing something. Doesn't this quick regs memory have to be
saved and restored on each context switch? If so, that is very expensive.
qregs[] is (IS) the actual register file (or files)--so, no added state.
On 5/14/2026 5:17 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
I must be missing something. Doesn't this quick regs memory have to be >>> saved and restored on each context switch? If so, that is very
expensive.
qregs[] is (IS) the actual register file (or files)--so, no added state.
Huh? In Bernd's post above, he expressly says adding a 4K fast SRAM to
the core. I don't think he was talking about the register file.
Thomas Koenig <[email protected]> posted:
MitchAlsup <[email protected]d> schrieb:
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}
C version loop invariant, and cursoring
#define N 8
void mm8(double *a, double *b, double *c)
{
int i,j,jN,k,kN;
double *AcijN,*AbkjN,*AaijN;
for( jN=0; jN<N*N; jN+=N ) {
AcijN = &c[jN];
AbkjN = &b[jN];
for( kN=k=0; k<N; k++,kN+=N ) {
AaikN = &a[kN];
bN = AbkjN[k];
for( i=0; i<N; i++ ) {
AcijN[i] += AaikN[i] * bN;
}
}
}
}
I did get this into:
mm8:
; R1 = &a[0];
; R2 = &b[0];
; R3 = &c[0]; -------------------------------------------------------------
MOV RjN,#0 ; R4
loop1:
LA RcijN,[Rc,RjN<<3] ; R5
LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
MOV RkN,#0 ; R7
MOV Rk,#0 ; R8
loop2:
LA RaikN,[Ra,RkN<<3] ; R9
LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
MOV Ri,#0 ; R11
VEC 8,{}
loop3:
LDD Ra,[RaikN,Ri<<3] ; R12
LDD Rc,[RcijN,Ri<<3] ; R13
FMAC Rc,Ra,Rb,Rc ; R14
STD Rc,[RcijN,Ri<<3] ;
LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
ADD Rk,Rk,#1 ; R8
ADD RkN,RkN,#8 ; R7
CMP Rt,Rk,#8 ; R11
BLE Rt,loop2 -------------------------------------------------------------
ADD RjN,RjN,#8 ; R4
CMP Rt,RjN,#64 ; R7
BLE Rt,Loop1 -------------------------------------------------------------
RET
without needing any preserved registers.
b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.
By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.
It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.
----------------------------------
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
double c0 = c[0 + j*N];
double c1 = c[1 + j*N];
double c2 = c[2 + j*N];
double c3 = c[3 + j*N];
double c4 = c[4 + j*N];
double c5 = c[5 + j*N];
double c6 = c[6 + j*N];
double c7 = c[7 + j*N];
for (int k=0; k<N; k++) {
double bk = b[k + j*N];
c0 += a[0 + k*N] * bk;
c1 += a[1 + k*N] * bk;
c2 += a[2 + k*N] * bk;
c3 += a[3 + k*N] * bk;
c4 += a[4 + k*N] * bk;
c5 += a[5 + k*N] * bk;
c6 += a[6 + k*N] * bk;
c7 += a[7 + k*N] * bk;
}
/* write back c0 to c7 */
}
}
}
where the loop over k could be vectorized, but that would still
leave eccessive memory traffic for a.
ENTER Rc1,Rc8,#0 ; preserve c[1..8]
MOV RjN,#0 ; R4
loop1:
LA Rca,[Rc,RjN<<3] ; &c[1..8]
LDD Rc1,[Rca,#0] ; R23
LDD Rc2,[Rca,#8]
LDD Rc3,[Rca,#16]
LDD Rc4,[Rca,#24]
LDD Rc5,[Rca,#32]
LDD Rc6,[Rca,#40]
LDD Rc7,[Rca,#48]
LDD Rc8,[Rca,#56] ; R30
MOV RkN,#0 ; R5
---------------begin vectorize-------------------
VEC 8,{Rc1..Rc8}
loop2:
LDD Rbk,[R2,RjN<<3] ; R6
LA RakN,[Ra,RkN<<3] ; R7
LDD Ra1,[RakN,#0] ; R8
FMAC Rc1,Ra1,Rbk,Rc1 ; R23
LDD Ra2,[RakN,#8] ; R7
FMAC Rc2,Ra2,Rbk,Rc2 ; R24
LDD Ra3,[RakN,#16] ; R7
FMAC Rc3,Ra3,Rbk,Rc3 ; R25
LDD Ra4,[RakN,#24] ; R7
FMAC Rc4,Ra4,Rbk,Rc4 ; R26
LDD Ra5,[RakN,#32] ; R7
FMAC Rc5,Ra2,Rbk,Rc6 ; R27
LDD Ra6,[RakN,#40] ; R7
FMAC Rc6,Ra2,Rbk,Rc6 ; R28
LDD Ra7,[RakN,#48] ; R7
FMAC Rc7,Ra2,Rbk,Rc7 ; R29
LDD Ra8,[RakN,#56] ; R7
FMAC Rc8,Ra8,Rbk,Rc8 ; R30
LOOP1 LE,RkN,#8,#64 ; R4
---------------end vectorize-------------------
ADD RkN,RkN,#8 ; R5
CMP Rt,RkN,$64 ; R6
BLE Rt,loop1
STD Rc1,[Rca,#0]
STD Rc2,[Rca,#8]
STD Rc3,[Rca,#16]
STD Rc4,[Rca,#24]
STD Rc5,[Rca,#32]
STD Rc6,[Rca,#40]
STD Rc7,[Rca,#48]
STD Rc8,[Rca,#56]
EXIT Rc1,Rc8,#0
RET
46 instructions 19 instructions in vectorized (unrolled) loop.
c[k] is read once and written once
b[k] is read 8×
a[k] is read 8×
If you are willing to have 64 FMACs in a row; a[k] can be read 2×
{with very tr1cky register allocation}.
Using this many registers causes 64 bytes to be written to stack
and read back later. Solving the a[k] traffic increases the stack
footprint to 104 bytes.
The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
On 5/3/2026 3:28 PM, MitchAlsup wrote:----------------------------
Thomas Koenig <[email protected]> posted:
MitchAlsup <[email protected]d> schrieb:
I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
this proposal.
First, is to enhance the JTT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.
After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different register(s).
Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.
So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,
thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop instruction has been encountered, you know how many of the executed instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in this case it eliminates 7/8, i.e. 87.5% of them.
So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount.
It also eliminates any
"mucking around" with the OoO mechanism to handle not knowing which registers are involved at instruction decode time that my previous
idea had.
As I said, I am sure there are issues with this. I welcome your comments.
Stephen Fuld <[email protected]d> posted:
On 5/3/2026 3:28 PM, MitchAlsup wrote:----------------------------
Thomas Koenig <[email protected]> posted:
MitchAlsup <[email protected]d> schrieb:
I have been thinking about this and have come up with another potential
solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by
allowing a sort of register indexing equivalent, I think it could make a
substantial reduction in the memory traffic. There are two parts to
this proposal.
First, is to enhance the JTT instruction, using the currently unused bit
combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.
A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.
After execution of that instruction,
which must not be a control transfer instruction, control is returned to
the instruction after the TT instruction. This makes the TT instruction
behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target
address would be eight FMAC instructions, each using a different
register(s).
This is why a CTT is better, you can perform multiple instructions before returning.
Of course, I realize that this adds an "extra" instruction
execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do
anything about extra instruction execution, but see below.
So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,
A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.
thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in
this case it eliminates 7/8, i.e. 87.5% of them.
Eliminates = 1.0-1.0/(loop_count)
So overall, I think this idea reduces the memory traffic cost of keeping
the A matrix in registers by a huge amount.
The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous
amounts. vVM only vectorizes the innermost loop.
On 5/25/2026 12:36 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 5/3/2026 3:28 PM, MitchAlsup wrote:----------------------------
Thomas Koenig <[email protected]> posted:
MitchAlsup <[email protected]d> schrieb:
I have been thinking about this and have come up with another potential
solution. The usual caveats - I am not a hardware designer, and don't
know the guts of the My 6600, nor am I a numerical analyst, so this may
have some problems or even be totally unworkable. But if it works, by
allowing a sort of register indexing equivalent, I think it could make a >> substantial reduction in the memory traffic. There are two parts to
this proposal.
First, is to enhance the JTT instruction, using the currently unused bit >> combination of the BOB field to indicate, execute the instruction
pointed to by the displacement plus the contents of SRC1 (similarly to
how the values 00 and 11 do now.
A minor issue is the out-of-sequence Fetch problem this introduces.
Not un-workable, but an annoyance. Perhaps Call-through-Table would
work better--let me think on it.
I understand. My rationale for using a different ROB value was to allow that functionality to be allowed within a VVM loop, whereas the other
values invoke a jump or call, which you don't want to allow in VVM. In other words an indication that this is a small exception to the no
control transfer rule and should be allowed. But obviously you
understand the internals better than I do.
After execution of that instruction,
which must not be a control transfer instruction, control is returned to >> the instruction after the TT instruction. This makes the TT instruction >> behave similarly to an "Execute" instruction in some other
architectures. For our current example, the instructions at the target
address would be eight FMAC instructions, each using a different
register(s).
This is why a CTT is better, you can perform multiple instructions before returning.
I understand. If you want to go there within VVM fine. I was trying to avoid that.
Of course, I realize that this adds an "extra" instruction >> execution, and that substituting an I-cache read (the executed
instruction) in place of a load doesn't seem like a savings. I can't do >> anything about extra instruction execution, but see below.
So second, I think there is an enhancement that would eliminate most of
the instruction fetches of the executed instructions. As I understand
it, in VVM, when they are first encountered, the instructions between
the VEC and the Loop instructions are fetched and stored in a special
memory within the CPU,
A very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an index field that matches on the Loop iteration index.
thus allowing multiple iterations of the loop
without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all
of them), and where to start them within this memory, (right after the
loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in >> this case it eliminates 7/8, i.e. 87.5% of them.
Eliminates = 1.0-1.0/(loop_count)
??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.
So overall, I think this idea reduces the memory traffic cost of keeping >> the A matrix in registers by a huge amount.
The issue is that there can be no-transfers-of-control* out of a vVM
loop while remaining IN a vVM loop. {This simplified HW by enormous amounts. vVM only vectorizes the innermost loop.
I agree. This would have to be an exception, which is why I thought a unique value for ROP could indicate that.
Stephen Fuld <[email protected]d> posted:
On 5/25/2026 12:36 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 5/3/2026 3:28 PM, MitchAlsup wrote:
Once the loop is in the reservation stations, they stay there for theA very minor change to Reservation Station logic, where static operands
can be used multiple times, and where the instruction remains present
after being fired until Loop is satisfied. Each RS operand contains an
index field that matches on the Loop iteration index.
thus allowing multiple iterations of the loop >>>> without multiple I-cache accesses. So the idea is, once the Loop
instruction has been encountered, you know how many of the executed
instructions can fit in the remaining space (in this case, I think all >>>> of them), and where to start them within this memory, (right after the >>>> loop instruction). So further iterations of the loop can execute the
target instructions without any I-cache references required. I think in >>>> this case it eliminates 7/8, i.e. 87.5% of them.
Eliminates = 1.0-1.0/(loop_count)
??? I thought you would fetch the "executed" instruction once and have
it internally for the next seven loop iterations. But I may be wrong.
entire execution of the loop! while FETCH remains quiescent.
Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.
Stephen Fuld [2026-05-26 11:55:11] wrote:
Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.
Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out (i.e. is executed) depends on the value in register Rn?
On 5/27/2026 7:57 AM, Stefan Monnier wrote:
Stephen Fuld [2026-05-26 11:55:11] wrote:
Yes. That is what I thought. This would be an exception to that in
that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those
eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.
Oh, I think I understand your proposal. You want a kind of predication
but instead of being a "predication-style `if`" it's a "predication-style
`case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out
(i.e. is executed) depends on the value in register Rn?
Yes, with the exception that they don't have to immediately follow what
you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.
On 5/27/2026 11:13 AM, Stephen Fuld wrote:
On 5/27/2026 7:57 AM, Stefan Monnier wrote:
Stephen Fuld [2026-05-26 11:55:11] wrote:
Yes. That is what I thought. This would be an exception to that in >>>> that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those >>>> eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.
Oh, I think I understand your proposal. You want a kind of predication >>> but instead of being a "predication-style `if`" it's a "predication-
style
`case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out >>> (i.e. is executed) depends on the value in register Rn?
Yes, with the exception that they don't have to immediately follow
what you call the NATPRED instruction (which is actually a modified TT
instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of
course, you don't have to "skip over" the instructions not executed,
you just use the M value to choose which of the instructions to execute.
Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.
And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.
Thank you Stefan for helping me see this.
So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?
On 5/27/2026 11:13 AM, Stephen Fuld wrote:
On 5/27/2026 7:57 AM, Stefan Monnier wrote:
Stephen Fuld [2026-05-26 11:55:11] wrote:
Yes. That is what I thought. This would be an exception to that in >>> that you would have to fetch and put into the reservation stations,
the eight instructions pointed to be the TT instruction. It is those >>> eight fetches out of the 64 executions of those instructions that led
me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
otherwise be required.
Oh, I think I understand your proposal. You want a kind of predication >> but instead of being a "predication-style `if`" it's a "predication-style >> `case`", i.e. based on a numeric rather than a boolean value.
E.g. you'd have a `NATPRED Rn, M` prefix instruction which
would "shadow" the next M instructions such that all but one of the
M instructions are "predicated out", and which one is not predicated-out >> (i.e. is executed) depends on the value in register Rn?
Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.
Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
a VVM loop rule.
And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.
Thank you Stefan for helping me see this.
So now the question is, does this save enough to be worth implementing?
I don't know enough to write prototype code for say MATMUL (8) to see
how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?
Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose
which of the instructions to execute.
Stephen Fuld <[email protected]d> posted:
Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your
understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the
"jump" characteristic which required an exception to the no jump within
a VVM loop rule.
The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.
And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.
That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.
Yes, with the exception that they don't have to immediately follow what you >> call the NATPRED instruction (which is actually a modified TT instruction). >> But once the instructions are loaded into the reservations stations, the
result is exactly as you describe. And, of course, you don't have to "skip >> over" the instructions not executed, you just use the M value to choose
which of the instructions to execute.
In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.
So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).
But instead of skipping, You can "predicate away" the undesirable instructions.
So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).
Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.
In vVM, all the instructions that make up the loop end up all placed in
the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.
So "skip over" makes no sense in this context (also because you want
a much shorter delay between "M is known" and "the corresponding
instruction is executed", so you have to decode the instruction(s)
before M is known).
But instead of skipping, You can "predicate away" the undesirable instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).
=== Stefan--- Synchronet 3.22a-Linux NewsLink 1.2
On 5/28/2026 11:08 AM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
snip
Sorry for the self followup, but I now believe that it would be better,
both clearer to understand and easier to implement, to follow your
understanding and to require the "predicated" instructions to
immediately follow the "NATPRED" in the code. This allows the
functionality to make use of the already existing mechanism to get
predicated instructions into the reservation stations and eliminates the >> "jump" characteristic which required an exception to the no jump within
a VVM loop rule.
The PRED instruction (when used in vVM loop} produces a lane mask so
that different iterations of the loop are executed from the same
starting time.
And since the NATPRED instruction as you defined it, and I agree, only
has two operands, perhaps a reasonable exception would be to allow a
third field that changes the sense of the test to allow things like
execute all instructions up to N, all instructions except N, etc. I
don't know how useful this sort of thing would be.
That is what the then-clause and else-clause 'numbers' do--they set
up the vertical lane-mask for that iteration. And each lane calculates
its own lane mask.
Yes. But the proposed enhancement to my original proposal gives you
another way to specify which instructions get executed. Your original
PRED has a fixed mask to choose which instructions are executed or not, based on a binary condition test. This enhancement allows choosing
which instruction gets executed in each lane based on the value in a register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
"indexed register accesses" at low cost (I believe one extra cycle.)
My original proposal allows you to execute one instruction based on the value in a register, i.e. if the register contains the value 3, then the third instruction is the only one executed. The enhanced version allows more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.
On 5/28/2026 4:29 PM, Stefan Monnier wrote:----------------------
So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).
Yes, size is clearly a limitation.
I think it works for MATMUL (8), but probably not for MATMUL (16).
it would be to allow more than the current 32 instructions within a VVM loop.
As for performance, I expect (hope) that instructions other than
the one whose position matches the register value will not be executed.
If that can be done, then the extra cost is presumably one cycle, and it saves (in the MATUL (8) example), executing eight load instructions.
Stephen Fuld <[email protected]d> posted:
My original proposal allows you to execute one instruction based on the
value in a register, i.e. if the register contains the value 3, then the
third instruction is the only one executed. The enhanced version allows
more flexibility. For example, you could allow to specify execute all
instructions up to the number in the register. As I said, while the use
case for the basic instruction is clear, emulating register indexing, I
am not sure there are any use cases for the enhancement.
Not sure what the source code would look like in order for the compiler
to recognize this pattern and optimize to your solution.
Stefan Monnier <[email protected]> posted:
But instead of skipping, You can "predicate away" the undesirableIf the predicated instructions are used in at least 1 iteration, they
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large
for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).
are not useless.
MitchAlsup [2026-05-29 16:55:38] wrote:
Stefan Monnier <[email protected]> posted:
But instead of skipping, You can "predicate away" the undesirableIf the predicated instructions are used in at least 1 iteration, they
instructions. So, in sum, I think what you describe can be made to
work. The main problem is that it will "fill" your dataflow core with
many "useless" instructions, so it risks making the whole loop too large >>> for vVM and it risks also making it inefficient (in case all
N instructions end up speculatively executed and the predication
operates by throwing away N-1 of the values).
are not useless.
They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
gives a "50% waste" (for equal size branches or when each branch is
taken as often as the other), whereas a predicated `switch` results in
a waste of `N-1/N`. As N grows larger this becomes discouraging.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,123 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 34:33:27 |
| Calls: | 14,371 |
| Files: | 186,380 |
| D/L today: |
1,058 files (298M bytes) |
| Messages: | 2,540,615 |