Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of instructions.
40-bit instructions will save 17% on the code space while being better than 95% as effective as the 48-bit instructions. The dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.
The primary use of the post-op postfix is to supply an additional
register for instructions like FMA or bitfield operations but it can
also provide a second operation.
The post-op is performed between the result of the first two source
operands and a third operand supplied by the post-op postfix. The trick
is that the post-op is treated as part of the first instruction by the
CPU. Both the original op and the post-op are performed by the ALU at
the same time. So, post-ops are almost fused instructions.
Dual operand instructions were used about 0.2% of the time in a small
sample of compiled code. IDK if they are worth it or not. I set my
cutoff at 0.1% useful.
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of instructions. 40-bit instructions will save 17% on the code space while being better than 95% as effective as the 48-bit instructions.
The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.
Robert Finch <[email protected]> schrieb:There are about 34 bits. The FMA is part of a group of float-ops
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of
instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
which still leaves you 14 bits for opcode space. I would have a lotIt operates like one 80-bit instruction with lots of unused bits.
of trouble filling that opcode space :-)
But FMA is used a lot, this should be one instruction really.
The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.
Four operands?
[...]
Robert Finch <[email protected]> schrieb:
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of
instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)
But FMA is used a lot, this should be one instruction really.
The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.
Four operands?
[...]
On 4/19/2026 3:16 PM, Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >>> instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)
But FMA is used a lot, this should be one instruction really.
IMHO:
Can do the basic case as a single-width 3R instruction.
Rd = Rd + Rs * Rt
Then, for the other cases, and for 4R, switch over to a longer encoding.
...
Trying to do 4R in a single instruction word is ineffective use of
encoding space.
RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.
This is comparable to the space XG3 burns on Jumbo Prefixes.
IMO, Jumbo Prefixes are encoding space better spent than 4R FMAC.
Well, and FMAC is one and done.
The scope of Jumbo Prefixes can expand gracefully.
Like (in XG3), I can add immediate forms of the 128-bit ALU instructions:
ADDX Rm, Imm33s, Rn
ANDX Rm, Imm33s, Rn
And, also Imm64 forms:
ADDX Rm, Imm64, Rn
ANDX Rm, Imm64, Rn
ISA level changes needed? Basically none.
I mostly just needed to tweak some decoding logic such that the high-
half of the ALU gets fed a sign extension of the low half (rather than a mirrored copy). Otherwise, it is immediate synthesis on an existing instruction.
No Imm128 forms ATM though.
Something like:
__int128 a, b, c;
...
c = a + 0x123456789ABCDEF123456789ABCDEFUI128;
Well, this is gonna require at least 3 instructions...
But, could at least in theory be single instruction if the immediate
fits within 64 bits.
Currently no good/obvious way to map this sort of stuff over to RISC-V
land though.
It is possible that I could define an "AWX" or similar special case via
the J21O scheme, could maybe, at least, get, say:
ADDX Fd, Fs, Ft
ADDX Fd, Fs, Imm17s
And, with an ADDX and LI_Imm33 and SHORI32_Imm32, could in theory get
the previous scenario down to 5 instructions:
LI F2, Imm63_32
LI F3, Imm127_96
SHORI F2, Imm31_0
SHORI F3, Imm95_64
ADDX F12, F10, F2
...
Equivalent operation would take 28 bytes to encode in XG3, and 40 bytes
in RV64+Jx. Or, maybe get it down to 32 bytes with the J52I prefixes.
Well, and if jumbo prefixes are a hard sell for RV land, 128-bit ALU ops
are likely to be worse.
...
The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.
Four operands?
[...]
On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >> instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- RcThere are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result status.
The opcode looks like:
SFFFFFFMMrRRRVV222222111111DDDDDDOOOOOPP
S=select exception status reg. 0 or 1
F=function code, identifies FMA and others
M=select constant for 1 or 2
r=record result status in ccr1
R=rounding mode
V=vector mask register to use
2=2nd source operand
1=1st source operand
D=destination operand
O=Primary Opcode
P=precision
It is not impossible to define FMA at the root level, using the function code bits for a register instead.
Note the use of the postfix is only a code density issue. It is absorbed
by the CPU and the FMA and postfix execute as one instruction so the
dynamic instruction count is not any different.
Internally, the micro-ops have room for three source operands.
I guess it is a matter of how often is FMA used. There are also FADD,
FSUB, FMUL instructions defined that are mapped to FMA internally which
do not require a postfix.
Q+ version 4 uses a 48-bit instruction that has room for three source registers, but most of the time the third register is not needed.
which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)
But FMA is used a lot, this should be one instruction really.It operates like one 80-bit instruction with lots of unused bits.
The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.
Four operands?
Potentially allows fused-dot-product. It may help with some other instructions perhaps reduction operations.
[...]
On 4/19/2026 3:16 PM, Thomas Koenig wrote:--------------------
Robert Finch <[email protected]> schrieb:
Trying to do 4R in a single instruction word is ineffective use of
encoding space.
RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.
On 2026-04-20 3:50 a.m., BGB wrote:--------------------------
IMHO:
Can do the basic case as a single-width 3R instruction.
Rd = Rd + Rs * Rt
I am not fond of destructive ops. It sometimes uses up an extra register
and instruction.
Then, for the other cases, and for 4R, switch over to a longer encoding.
...
Turns out there is not enough room at the root level to add FMA at four different precisions with sign control (16 opcodes).
I cannot see there being a lot of use for 128-bit immediates, except for when one wants to use an immediate for a SIMD operation in which case
all the bits are needed.
On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:There are about 34 bits. The FMA is part of a group of float-ops
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >>> instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result status.
BGB <[email protected]> posted:
On 4/19/2026 3:16 PM, Thomas Koenig wrote:--------------------
Robert Finch <[email protected]> schrieb:
Trying to do 4R in a single instruction word is ineffective use of
encoding space.
RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.
My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution
Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Constant insertion gives me:
FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3
From 1 instruction.
On 4/20/2026 11:36 AM, MitchAlsup wrote:
BGB <[email protected]> posted:
On 4/19/2026 3:16 PM, Thomas Koenig wrote:--------------------
Robert Finch <[email protected]> schrieb:
Trying to do 4R in a single instruction word is ineffective use of
encoding space.
RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.
My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution
Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.
I am far from a applied mathematician, so these may be silly, but . . .
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Is there no need for sign control of R3?
Constant insertion gives me:
FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3
Similarly to my above question, is there no need for an immediate for R3?
On 4/20/2026 11:36 AM, MitchAlsup wrote:
BGB <[email protected]> posted:
On 4/19/2026 3:16 PM, Thomas Koenig wrote:--------------------
Robert Finch <[email protected]> schrieb:
Trying to do 4R in a single instruction word is ineffective use of
encoding space.
RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.
My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution
Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS, CMOV, LOOP; with 3 left unassigned.
I am far from a applied mathematician, so these may be silly, but . . .
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Is there no need for sign control of R3?
Constant insertion gives me:
FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3
Similarly to my above question, is there no need for an immediate for R3?
Note that if you have an immediate value of zero for for one of the
addends, you have an FP multiply, so could use that to eliminate an op
code (probably special case the hardware for speed). Similarly, if you
had an immediate for R3 and had a value of one for it, you have defined
an FP add, so could eliminate another op code.
From 1 instruction.
Stephen Fuld <[email protected]d> schrieb:
On 4/20/2026 11:36 AM, MitchAlsup wrote:
BGB <[email protected]> posted:
On 4/19/2026 3:16 PM, Thomas Koenig wrote:--------------------
Robert Finch <[email protected]> schrieb:
Trying to do 4R in a single instruction word is ineffective use of
encoding space.
RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.
My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution
Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.
I am far from a applied mathematician, so these may be silly, but . . .
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Is there no need for sign control of R3?
Not sure what for - sign control on the product should be
enough :-)
Constant insertion gives me:
FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3
Similarly to my above question, is there no need for an immediate for R3?
R2 and R3 are interchangeable.
Robert Finch <[email protected]> schrieb:
On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:There are about 34 bits. The FMA is part of a group of float-ops
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >>>> instructions. 40-bit instructions will save 17% on the code space while >>>> being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.
Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.
[...]
On 4/20/2026 1:55 PM, Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:
On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:There are about 34 bits. The FMA is part of a group of float-ops
Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the
size of
instructions. 40-bit instructions will save 17% on the code space
while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register >>>> leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.
Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.
IMHO: No.
For fixed rounding modes, in my case they are either:
RNE, Default, so nothing special needed;
DYN, Also instructions exists for this.
DYN fetches the RM from FPSCR;
Others: Jumbo Prefix.
Likewise for 4R (non-destructive FMAC).
So, single 3R op as basic case:
FMAC Rm, Ro, Rn
Where:
Rn=Rn+Rm*Ro
But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)
With some bits to also specify things like rounding mode and SIMD
variants if desired.
There is a big tradeoff though where there are a few orders of magnitude
of performance difference based on whether it needs to be single-rounded (IOW: Don't try to do "Double-Double" with this thing).
Seemingly failed to mention this earlier, getting more distracted with thinking about Int128 it seems...
...
[...]
On 2026-04-20 7:37 p.m., BGB wrote:Entropy for Q+4 was only 0.21 out of 8 for the boot file.
On 4/20/2026 1:55 PM, Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:
On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:There are about 34 bits. The FMA is part of a group of float-ops
Working on Q+ version 5 now. Version 5 is only going to support two >>>>>> source operands per instruction instead of three to decrease the
size of
instructions. 40-bit instructions will save 17% on the code space >>>>>> while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register >>>>> leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
specified by a six bit function code. The primary opcode is also seven >>>> (5+2) bits. However, two bits of the primary opcode are used to specify >>>> the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.
Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.
IMHO: No.
For fixed rounding modes, in my case they are either:
RNE, Default, so nothing special needed;
DYN, Also instructions exists for this.
DYN fetches the RM from FPSCR;
Others: Jumbo Prefix.
Likewise for 4R (non-destructive FMAC).
So, single 3R op as basic case:
FMAC Rm, Ro, Rn
Where:
Rn=Rn+Rm*Ro
But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)
With some bits to also specify things like rounding mode and SIMD
variants if desired.
There is a big tradeoff though where there are a few orders of
magnitude of performance difference based on whether it needs to be
single-rounded (IOW: Don't try to do "Double-Double" with this thing).
Seemingly failed to mention this earlier, getting more distracted with
thinking about Int128 it seems...
...
[...]
The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.
I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?
I found an entropy measurer on the web and built it.
An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.
Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.
There are six combinations for rounding modes including DYN rounding. I found a potential use for another round code (statistical or random
rounding fed from an entropy source), so I am hesitant to reduce these.
Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
bits are not present the bits are used to extend the register selection
to the full 128 registers.
Instructions with a rounding mode are limited to 64 registers.
With 80-bit instructions supplying two more registers, a fused dot-
product can be done.
Rd = (Rs1*Rs2)+(Rs3*Rs4)
It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
port for the vector mask register plus a port for the rounding mode.
The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.
I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?
An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.
Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.
On 2026-04-20 7:37 p.m., BGB wrote:
On 4/20/2026 1:55 PM, Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:
On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
Robert Finch <[email protected]> schrieb:There are about 34 bits. The FMA is part of a group of float-ops
Working on Q+ version 5 now. Version 5 is only going to support two >>>>>> source operands per instruction instead of three to decrease the
size of
instructions. 40-bit instructions will save 17% on the code space >>>>>> while
being better than 95% as effective as the 48-bit instructions.
You have 40 bit instructions and 64 registers. Six bits per register >>>>> leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have
Rd = +/- Ra * Rb +/- Rc
specified by a six bit function code. The primary opcode is also seven >>>> (5+2) bits. However, two bits of the primary opcode are used to specify >>>> the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.
Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.
IMHO: No.
For fixed rounding modes, in my case they are either:
RNE, Default, so nothing special needed;
DYN, Also instructions exists for this.
DYN fetches the RM from FPSCR;
Others: Jumbo Prefix.
Likewise for 4R (non-destructive FMAC).
So, single 3R op as basic case:
FMAC Rm, Ro, Rn
Where:
Rn=Rn+Rm*Ro
But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)
With some bits to also specify things like rounding mode and SIMD
variants if desired.
There is a big tradeoff though where there are a few orders of
magnitude of performance difference based on whether it needs to be
single-rounded (IOW: Don't try to do "Double-Double" with this thing).
Seemingly failed to mention this earlier, getting more distracted with
thinking about Int128 it seems...
...
[...]
The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.
I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?
An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.
Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.
There are six combinations for rounding modes including DYN rounding. I found a potential use for another round code (statistical or random
rounding fed from an entropy source), so I am hesitant to reduce these.
Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
bits are not present the bits are used to extend the register selection
to the full 128 registers.
Instructions with a rounding mode are limited to 64 registers.
With 80-bit instructions supplying two more registers, a fused dot-
product can be done.
Rd = (Rs1*Rs2)+(Rs3*Rs4)
It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
port for the vector mask register plus a port for the rounding mode.
Stephen Fuld <[email protected]d> posted:
On 4/20/2026 11:36 AM, MitchAlsup wrote:
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Is there no need for sign control of R3?
Since * is commutative, sign control over * can be applied
to either R2 or R3
Robert Finch <[email protected]> schrieb:
The function code is not present in all instructions. Loads and stores
immediate operates and other miscellaneous instructions do not have the
field. They go by the seven bit primary opcode.
I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate
entropy? Something that can scan a binary file and rate the entropy?
Mitch has a very efficient packing of bits in his ISA, which has
32-bit istructions. It would be possible in theory (not suggesting
that you should do it :-) to take his encodings, make all register
specifiers 7 bit to accommodate your 128 registers (which would
give you 40 instead of 32 bits for four-register instructions)
and then wonder what to do with the holes left by the instructions
with fewer than four registers.
An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.
Momentarily thinking of a dynamically changing ISA based on the program
class. It could be controlled by a register in the CPU.
I don't think this is a good idea. I assume you would want to
reuse the opcode space (let's call the versions then A and B).
What if you use ISA A, and the program dynamically loads a library
using ISA B? Or is this something that each function would
have to set dynamically?
On 2026-04-20 7:37 p.m., BGB wrote:----------------------
On 4/20/2026 1:55 PM, Thomas Koenig wrote:
The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.
I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?
An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.
Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.
MitchAlsup <[email protected]d> writes:
Stephen Fuld <[email protected]d> posted:
On 4/20/2026 11:36 AM, MitchAlsup wrote:
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Is there no need for sign control of R3?
Since * is commutative, sign control over * can be applied
to either R2 or R3
Actually, that's only true because
1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b
2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b
I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
or -0.
But in most code the difference is not important, so few programmers
write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).
In the few cases where the difference (if it exists) is important,
(-a)*(-b) can be encoded as negation followed by FMAC.
Anton Ertl wrote:
MitchAlsup <[email protected]d> writes:
Stephen Fuld <[email protected]d> posted:
On 4/20/2026 11:36 AM, MitchAlsup wrote:
Sign Control gives me:
FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3
Is there no need for sign control of R3?
Since * is commutative, sign control over * can be applied
to either R2 or R3
Actually, that's only true because
1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b
2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b
I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
or -0.
They do, including the single interesting case of both a & b being +/-
zero:
+ * + -> +
+ * - -> -
etc
When just one operand is zero, the result is also zero, and the sign
follows the usual product rules:
positive a * -0.0 -> -0.0
But in most code the difference is not important, so few programmers
write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).
Should not be needed. We try hard to attain minimum surprise factor.
In the few cases where the difference (if it exists) is important,
(-a)*(-b) can be encoded as negation followed by FMAC.
Ditto.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,114 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492511:55:43 |
| Calls: | 14,267 |
| Calls today: | 3 |
| Files: | 186,320 |
| D/L today: |
26,179 files (8,480M bytes) |
| Messages: | 2,518,387 |