Forum: War Ensemble BBS

Q+ status / post-op instructions

From Robert Finch@[email protected] to comp.arch on Sat Apr 18 21:24:52 2026

From Newsgroup: comp.arch

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions. The dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

The primary use of the post-op postfix is to supply an additional
register for instructions like FMA or bitfield operations but it can
also provide a second operation.

The post-op is performed between the result of the first two source
operands and a third operand supplied by the post-op postfix. The trick
is that the post-op is treated as part of the first instruction by the
CPU. Both the original op and the post-op are performed by the ALU at
the same time. So, post-ops are almost fused instructions.

Dual operand instructions were used about 0.2% of the time in a small
sample of compiled code. IDK if they are worth it or not. I set my
cutoff at 0.1% useful.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun Apr 19 18:28:25 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> posted:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of instructions.

How do you support FMAC ??

40-bit instructions will save 17% on the code space while being better than 95% as effective as the 48-bit instructions. The dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

The primary use of the post-op postfix is to supply an additional
register for instructions like FMA or bitfield operations but it can
also provide a second operation.

Paying extra for THE workhorse FP calculation...

The post-op is performed between the result of the first two source
operands and a third operand supplied by the post-op postfix. The trick
is that the post-op is treated as part of the first instruction by the
CPU. Both the original op and the post-op are performed by the ALU at
the same time. So, post-ops are almost fused instructions.

Dual operand instructions were used about 0.2% of the time in a small
sample of compiled code. IDK if they are worth it or not. I set my
cutoff at 0.1% useful.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sun Apr 19 20:16:40 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of instructions. 40-bit instructions will save 17% on the code space while being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)

But FMA is used a lot, this should be one instruction really.

The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

Four operands?

[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Sun Apr 19 21:49:03 2026

From Newsgroup: comp.arch

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of
instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.

The opcode looks like:
SFFFFFFMMrRRRVV222222111111DDDDDDOOOOOPP

S=select exception status reg. 0 or 1
F=function code, identifies FMA and others
M=select constant for 1 or 2
r=record result status in ccr1
R=rounding mode
V=vector mask register to use
2=2nd source operand
1=1st source operand
D=destination operand
O=Primary Opcode
P=precision

It is not impossible to define FMA at the root level, using the function
code bits for a register instead.

Note the use of the postfix is only a code density issue. It is absorbed
by the CPU and the FMA and postfix execute as one instruction so the
dynamic instruction count is not any different.
Internally, the micro-ops have room for three source operands.

I guess it is a matter of how often is FMA used. There are also FADD,
FSUB, FMUL instructions defined that are mapped to FMA internally which
do not require a postfix.

Q+ version 4 uses a 48-bit instruction that has room for three source registers, but most of the time the third register is not needed.

which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)

But FMA is used a lot, this should be one instruction really.

It operates like one 80-bit instruction with lots of unused bits.

The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

Four operands?

Potentially allows fused-dot-product. It may help with some other
instructions perhaps reduction operations.

[...]

--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Mon Apr 20 02:50:18 2026

From Newsgroup: comp.arch

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of
instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)

But FMA is used a lot, this should be one instruction really.

IMHO:
Can do the basic case as a single-width 3R instruction.
Rd = Rd + Rs * Rt

Then, for the other cases, and for 4R, switch over to a longer encoding.

...

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

This is comparable to the space XG3 burns on Jumbo Prefixes.
IMO, Jumbo Prefixes are encoding space better spent than 4R FMAC.

Well, and FMAC is one and done.

The scope of Jumbo Prefixes can expand gracefully.

Like (in XG3), I can add immediate forms of the 128-bit ALU instructions:
ADDX Rm, Imm33s, Rn
ANDX Rm, Imm33s, Rn
And, also Imm64 forms:
ADDX Rm, Imm64, Rn
ANDX Rm, Imm64, Rn

ISA level changes needed? Basically none.

I mostly just needed to tweak some decoding logic such that the
high-half of the ALU gets fed a sign extension of the low half (rather
than a mirrored copy). Otherwise, it is immediate synthesis on an
existing instruction.

No Imm128 forms ATM though.

Something like:
__int128 a, b, c;
...
c = a + 0x123456789ABCDEF123456789ABCDEFUI128;
Well, this is gonna require at least 3 instructions...

But, could at least in theory be single instruction if the immediate
fits within 64 bits.

Currently no good/obvious way to map this sort of stuff over to RISC-V
land though.

It is possible that I could define an "AWX" or similar special case via
the J21O scheme, could maybe, at least, get, say:
ADDX Fd, Fs, Ft
ADDX Fd, Fs, Imm17s

And, with an ADDX and LI_Imm33 and SHORI32_Imm32, could in theory get
the previous scenario down to 5 instructions:
LI F2, Imm63_32
LI F3, Imm127_96
SHORI F2, Imm31_0
SHORI F3, Imm95_64
ADDX F12, F10, F2

...

Equivalent operation would take 28 bytes to encode in XG3, and 40 bytes
in RV64+Jx. Or, maybe get it down to 32 bytes with the J52I prefixes.

Well, and if jumbo prefixes are a hard sell for RV land, 128-bit ALU ops
are likely to be worse.

...

The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

Four operands?

[...]

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Mon Apr 20 06:21:07 2026

From Newsgroup: comp.arch

On 2026-04-20 3:50 a.m., BGB wrote:

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >>> instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

   Rd = +/- Ra * Rb +/- Rc

which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)

But FMA is used a lot, this should be one instruction really.

IMHO:
Can do the basic case as a single-width 3R instruction.
Rd = Rd + Rs * Rt

I am not fond of destructive ops. It sometimes uses up an extra register
and instruction.

Then, for the other cases, and for 4R, switch over to a longer encoding.

...

Turns out there is not enough room at the root level to add FMA at four different precisions with sign control (16 opcodes). So, I just added
FMA (no sign control) with three different precisions at the root level.
The other combinations of FMA will need to be handled with a wider
instruction (80 bits).

Now that I think about it, it may be better to have FMA with sign
control at double precision at the root opcode level, and leave other precisions with wider formats.

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

What I was trying to avoid. FMA uses a lot of opcode space for just one instruction that for some programs is not used. At the same time some
programs depending on float math use it a lot.

I suppose the ISA could be re-configed depending on the program. I
wonder how difficult it would be to use the RISCV ISA. Scratches head.

This is comparable to the space XG3 burns on Jumbo Prefixes.
IMO, Jumbo Prefixes are encoding space better spent than 4R FMAC.

Well, and FMAC is one and done.

The scope of Jumbo Prefixes can expand gracefully.

Like (in XG3), I can add immediate forms of the 128-bit ALU instructions:
ADDX Rm, Imm33s, Rn
ANDX Rm, Imm33s, Rn
And, also Imm64 forms:
ADDX Rm, Imm64, Rn
ANDX Rm, Imm64, Rn

ISA level changes needed? Basically none.

I handle this with piecemeal post-fixes so that a fixed size instruction
can be used. Which is basically a single variable width instruction but
it has the root opcode (= a NOP) embedded at the start locations where
an instruction might start.

I have not figured out a good way to manage the increment for variable
sized instructions. I am running a simpler machine at the cost of some
code density. I wonder though, that the implementation has too much of
an impact on the ISA. It may be better to go with variable length
instructions with a lousy implementation of PC increment to keep the ISA cleaner.

I mostly just needed to tweak some decoding logic such that the high-
half of the ALU gets fed a sign extension of the low half (rather than a mirrored copy). Otherwise, it is immediate synthesis on an existing instruction.

No Imm128 forms ATM though.

Something like:
__int128 a, b, c;
...
c = a + 0x123456789ABCDEF123456789ABCDEFUI128;
Well, this is gonna require at least 3 instructions...

But, could at least in theory be single instruction if the immediate
fits within 64 bits.

Currently no good/obvious way to map this sort of stuff over to RISC-V
land though.

It is possible that I could define an "AWX" or similar special case via
the J21O scheme, could maybe, at least, get, say:
ADDX Fd, Fs, Ft
ADDX Fd, Fs, Imm17s

And, with an ADDX and LI_Imm33 and SHORI32_Imm32, could in theory get
the previous scenario down to 5 instructions:
LI     F2, Imm63_32
LI     F3, Imm127_96
SHORI F2, Imm31_0
SHORI F3, Imm95_64
ADDX   F12, F10, F2

...

Equivalent operation would take 28 bytes to encode in XG3, and 40 bytes
in RV64+Jx. Or, maybe get it down to 32 bytes with the J52I prefixes.

Well, and if jumbo prefixes are a hard sell for RV land, 128-bit ALU ops
are likely to be worse.

I cannot see there being a lot of use for 128-bit immediates, except for
when one wants to use an immediate for a SIMD operation in which case
all the bits are needed.

...

The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

Four operands?

[...]

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 18:20:28 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> posted:

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >> instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result status.

The opcode looks like:
SFFFFFFMMrRRRVV222222111111DDDDDDOOOOOPP

S=select exception status reg. 0 or 1
F=function code, identifies FMA and others
M=select constant for 1 or 2
r=record result status in ccr1
R=rounding mode
V=vector mask register to use
2=2nd source operand
1=1st source operand
D=destination operand
O=Primary Opcode
P=precision

May I suggest:: 6-bit Function code and 5-bit Primary OpCode
and 2-bit precision seems to be excessive bit count for what
you are getting out of them.

Additionally, 3-bits for 5-states {RM} is a waste of entropy.

What I did to address the waste of entropy was to define 4-bits
to cover all the operand sign control and insertion of constants
{5-bit tiny, 32-bit normal, 64-bit large}. By leaving out seldom
used patterns the independent desires are crammed into fewer bits.

It is not impossible to define FMA at the root level, using the function code bits for a register instead.

Note the use of the postfix is only a code density issue. It is absorbed
by the CPU and the FMA and postfix execute as one instruction so the
dynamic instruction count is not any different.
Internally, the micro-ops have room for three source operands.

I guess it is a matter of how often is FMA used. There are also FADD,
FSUB, FMUL instructions defined that are mapped to FMA internally which
do not require a postfix.

Q+ version 4 uses a 48-bit instruction that has room for three source registers, but most of the time the third register is not needed.

which still leaves you 14 bits for opcode space. I would have a lot
of trouble filling that opcode space :-)

But FMA is used a lot, this should be one instruction really.

It operates like one 80-bit instruction with lots of unused bits.

The
dual-operation instructions are replaced with post-ops. To use a third
or a fourth operand a postfix instruction is needed.

Four operands?

Potentially allows fused-dot-product. It may help with some other instructions perhaps reduction operations.

[...]

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 18:36:13 2026

From Newsgroup: comp.arch

BGB <[email protected]> posted:

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

--------------------

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution

Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Constant insertion gives me:

FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3

From 1 instruction.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 18:43:59 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> posted:

On 2026-04-20 3:50 a.m., BGB wrote:

--------------------------

IMHO:
Can do the basic case as a single-width 3R instruction.
Rd = Rd + Rs * Rt

I am not fond of destructive ops. It sometimes uses up an extra register
and instruction.

Nor am I.

Then, for the other cases, and for 4R, switch over to a longer encoding.

...

Turns out there is not enough room at the root level to add FMA at four different precisions with sign control (16 opcodes).

See my immediately previous post on how I got it done in 32-bits {for
the no constants case and the 5-bit constant case}.

When an 5-bit immediate is used as a constant in a FP calculation,
it represents the range {-15.5..+15.5} instead of {-32..31}.

-----------------

I cannot see there being a lot of use for 128-bit immediates, except for when one wants to use an immediate for a SIMD operation in which case
all the bits are needed.

I see not enough use of 128-bit to waste entropy in accommodating.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Mon Apr 20 18:55:51 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> schrieb:

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >>> instructions. 40-bit instructions will save 17% on the code space while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result status.

Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.

[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon Apr 20 12:17:55 2026

From Newsgroup: comp.arch

On 4/20/2026 11:36 AM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

--------------------

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution

Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.

I am far from a applied mathematician, so these may be silly, but . . .

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Constant insertion gives me:

FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3

Similarly to my above question, is there no need for an immediate for R3?

Note that if you have an immediate value of zero for for one of the
addends, you have an FP multiply, so could use that to eliminate an op
code (probably special case the hardware for speed). Similarly, if you
had an immediate for R3 and had a value of one for it, you have defined
an FP add, so could eliminate another op code.

From 1 instruction.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Mon Apr 20 20:35:10 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> schrieb:

On 4/20/2026 11:36 AM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

--------------------

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution

Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.

I am far from a applied mathematician, so these may be silly, but . . .

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Not sure what for - sign control on the product should be
enough :-)

Constant insertion gives me:

FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3

Similarly to my above question, is there no need for an immediate for R3?

R2 and R3 are interchangeable.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 21:24:39 2026

From Newsgroup: comp.arch

Stephen Fuld <[email protected]d> posted:

On 4/20/2026 11:36 AM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

--------------------

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution

Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS, CMOV, LOOP; with 3 left unassigned.

I am far from a applied mathematician, so these may be silly, but . . .

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Since * is commutative, sign control over * can be applied
to either R2 or R3, I chose sign control over R1 and R2 and
not over R3; then carefully chose which reg is +() and which
are (*).

Constant insertion gives me:

FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3

Similarly to my above question, is there no need for an immediate for R3?

Again, * is commutative.

Note that if you have an immediate value of zero for for one of the
addends, you have an FP multiply, so could use that to eliminate an op
code (probably special case the hardware for speed). Similarly, if you
had an immediate for R3 and had a value of one for it, you have defined
an FP add, so could eliminate another op code.

I have both {values 0 and 1} available, but I also have FADD and FMUL instructions. Given a Great-Big machine, one will likely have all 3
function units, so FMUL can be routed to FMUL FU or FMAC FU, likewise
for FADD to FADD or FMAC. In a Little-Bitty machine, FMAC can be the
only FP FU.

From 1 instruction.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Mon Apr 20 15:10:26 2026

From Newsgroup: comp.arch

On 4/20/2026 1:35 PM, Thomas Koenig wrote:

Stephen Fuld <[email protected]d> schrieb:

On 4/20/2026 11:36 AM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 4/19/2026 3:16 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

--------------------

Trying to do 4R in a single instruction word is ineffective use of
encoding space.

RISC-V falls into this trap in a few cases:
FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

My 66000 has a subGroup of 8 instructions that cover all uses of
3-operand 1-result (4-register) Major = {010xxx} So,
Major 6-bits
Result 5-bits
Source 15-bits
--------------
26-bits
leaving 6-bits which I then use
2-bits for size {precision}
4-bits Signs/Constant substitution

Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
CMOV, LOOP; with 3 left unassigned.

I am far from a applied mathematician, so these may be silly, but . . .

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Not sure what for - sign control on the product should be
enough :-)

DUH! I feel so stupid. :-(

Constant insertion gives me:

FMAC Rd= Im5+R2*R3
FMAC Rd= Rd+Im5*R3
FMAC Rd= F32+R2*R3
FMAC Rd= R1+F32*R3
FMAC Rd= F64+R2*R3
FMAC Rd= R1+F64*R3

Similarly to my above question, is there no need for an immediate for R3?

R2 and R3 are interchangeable.

Yup. DUH! again.

Thank you.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Mon Apr 20 18:37:23 2026

From Newsgroup: comp.arch

On 4/20/2026 1:55 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the size of >>>> instructions. 40-bit instructions will save 17% on the code space while >>>> being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register
leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.

Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.

IMHO: No.

For fixed rounding modes, in my case they are either:
RNE, Default, so nothing special needed;
DYN, Also instructions exists for this.
DYN fetches the RM from FPSCR;
Others: Jumbo Prefix.

Likewise for 4R (non-destructive FMAC).

So, single 3R op as basic case:
FMAC Rm, Ro, Rn
Where:
Rn=Rn+Rm*Ro

But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)

With some bits to also specify things like rounding mode and SIMD
variants if desired.

There is a big tradeoff though where there are a few orders of magnitude
of performance difference based on whether it needs to be single-rounded
(IOW: Don't try to do "Double-Double" with this thing).

Seemingly failed to mention this earlier, getting more distracted with thinking about Int128 it seems...

...

[...]

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Mon Apr 20 21:51:18 2026

From Newsgroup: comp.arch

On 2026-04-20 7:37 p.m., BGB wrote:

On 4/20/2026 1:55 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two
source operands per instruction instead of three to decrease the
size of
instructions. 40-bit instructions will save 17% on the code space
while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register >>>> leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven
(5+2) bits. However, two bits of the primary opcode are used to specify
the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.

Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.

IMHO: No.

For fixed rounding modes, in my case they are either:
RNE, Default, so nothing special needed;
DYN, Also instructions exists for this.
DYN fetches the RM from FPSCR;
Others: Jumbo Prefix.

Likewise for 4R (non-destructive FMAC).

So, single 3R op as basic case:
FMAC Rm, Ro, Rn
Where:
Rn=Rn+Rm*Ro

But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)

With some bits to also specify things like rounding mode and SIMD
variants if desired.

There is a big tradeoff though where there are a few orders of magnitude
of performance difference based on whether it needs to be single-rounded (IOW: Don't try to do "Double-Double" with this thing).

Seemingly failed to mention this earlier, getting more distracted with thinking about Int128 it seems...

...

[...]

The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the
field. They go by the seven bit primary opcode.

I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate
entropy? Something that can scan a binary file and rate the entropy?

An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.

Momentarily thinking of a dynamically changing ISA based on the program
class. It could be controlled by a register in the CPU.

There are six combinations for rounding modes including DYN rounding. I
found a potential use for another round code (statistical or random
rounding fed from an entropy source), so I am hesitant to reduce these.

Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
bits are not present the bits are used to extend the register selection
to the full 128 registers.
Instructions with a rounding mode are limited to 64 registers.

With 80-bit instructions supplying two more registers, a fused
dot-product can be done.

Rd = (Rs1*Rs2)+(Rs3*Rs4)

It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
port for the vector mask register plus a port for the rounding mode.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Mon Apr 20 22:50:14 2026

From Newsgroup: comp.arch

On 2026-04-20 9:51 p.m., Robert Finch wrote:

On 2026-04-20 7:37 p.m., BGB wrote:

On 4/20/2026 1:55 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two >>>>>> source operands per instruction instead of three to decrease the
size of
instructions. 40-bit instructions will save 17% on the code space >>>>>> while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register >>>>> leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven >>>> (5+2) bits. However, two bits of the primary opcode are used to specify >>>> the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.

Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.

IMHO: No.

For fixed rounding modes, in my case they are either:
   RNE, Default, so nothing special needed;
   DYN, Also instructions exists for this.
     DYN fetches the RM from FPSCR;
   Others: Jumbo Prefix.

Likewise for 4R (non-destructive FMAC).

So, single 3R op as basic case:
   FMAC Rm, Ro, Rn
Where:
   Rn=Rn+Rm*Ro

But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)

With some bits to also specify things like rounding mode and SIMD
variants if desired.

There is a big tradeoff though where there are a few orders of
magnitude of performance difference based on whether it needs to be
single-rounded (IOW: Don't try to do "Double-Double" with this thing).

Seemingly failed to mention this earlier, getting more distracted with
thinking about Int128 it seems...

...

[...]

The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

I found an entropy measurer on the web and built it.

Entropy for Q+4 was only 0.21 out of 8 for the boot file.

An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.

Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

There are six combinations for rounding modes including DYN rounding. I found a potential use for another round code (statistical or random
rounding fed from an entropy source), so I am hesitant to reduce these.

Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
bits are not present the bits are used to extend the register selection
to the full 128 registers.
Instructions with a rounding mode are limited to 64 registers.

With 80-bit instructions supplying two more registers, a fused dot-
product can be done.

Rd = (Rs1*Rs2)+(Rs3*Rs4)

It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
port for the vector mask register plus a port for the rounding mode.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Tue Apr 21 06:33:45 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> schrieb:

The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

Mitch has a very efficient packing of bits in his ISA, which has
32-bit istructions. It would be possible in theory (not suggesting
that you should do it :-) to take his encodings, make all register
specifiers 7 bit to accommodate your 128 registers (which would
give you 40 instead of 32 bits for four-register instructions)
and then wonder what to do with the holes left by the instructions
with fewer than four registers.

An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.

Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

I don't think this is a good idea. I assume you would want to
reuse the opcode space (let's call the versions then A and B).
What if you use ISA A, and the program dynamically loads a library
using ISA B? Or is this something that each function would
have to set dynamically?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Tue Apr 21 01:36:09 2026

From Newsgroup: comp.arch

On 4/20/2026 8:51 PM, Robert Finch wrote:

On 2026-04-20 7:37 p.m., BGB wrote:

On 4/20/2026 1:55 PM, Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

On 2026-04-19 4:16 p.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

Working on Q+ version 5 now. Version 5 is only going to support two >>>>>> source operands per instruction instead of three to decrease the
size of
instructions. 40-bit instructions will save 17% on the code space >>>>>> while
being better than 95% as effective as the 48-bit instructions.

You have 40 bit instructions and 64 registers. Six bits per register >>>>> leves 16 bit for opcodes for four-register operations. Add two
sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc

There are about 34 bits. The FMA is part of a group of float-ops
specified by a six bit function code. The primary opcode is also seven >>>> (5+2) bits. However, two bits of the primary opcode are used to specify >>>> the precision. Two bits are used to specify a vector mask. Three bits
are used to specify a rounding mode. And one bit to record the result
status.

Is the rounding mode really needed in every instruction? You would
need a dynamic rounding mode anyway, and this could save you
three bits. There will likely not be many instructions with four
registers, so you could fewer bits for that particular opcode group.
Having two bits of your primary opcode space always reserved for
precision also seems a lot; there should be operations where this
is not needed.

IMHO: No.

For fixed rounding modes, in my case they are either:
   RNE, Default, so nothing special needed;
   DYN, Also instructions exists for this.
     DYN fetches the RM from FPSCR;
   Others: Jumbo Prefix.

Likewise for 4R (non-destructive FMAC).

So, single 3R op as basic case:
   FMAC Rm, Ro, Rn
Where:
   Rn=Rn+Rm*Ro

But, then with the Jumbo Prefix one can get:
* FMAC: Rn=Rp+Rm*Ro
* FMAS: Rn=Rm*Ro-Rp
* FMRS: Rn=Rp-Rm*Ro
* FMRA: Rn=-(Rp+Rm*Ro)

With some bits to also specify things like rounding mode and SIMD
variants if desired.

There is a big tradeoff though where there are a few orders of
magnitude of performance difference based on whether it needs to be
single-rounded (IOW: Don't try to do "Double-Double" with this thing).

Seemingly failed to mention this earlier, getting more distracted with
thinking about Int128 it seems...

...

[...]

The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

Dunno there.

I had mostly been doing everything manually, looking at stats and text
files and looking for patterns.

But, as noted, the majority of ops in my case end up being 32 bits.
And, it seems I have recently started doing a little better on the code density front despite the lack of 16-bit ops in the case of XG3.

An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.

I mostly ended up using Binary64 for all the scalar floating point in registers, but with more compact formats often being used in memory.

This partly changed with SIMD and RV support.

Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

There are six combinations for rounding modes including DYN rounding. I found a potential use for another round code (statistical or random
rounding fed from an entropy source), so I am hesitant to reduce these.

Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
bits are not present the bits are used to extend the register selection
to the full 128 registers.
Instructions with a rounding mode are limited to 64 registers.

Only a few instructions include rounding modes, but only directly in jumbo-prefixed forms.

Though:
FADD/FADDG/FADDA could be considered as overlapping with the role of a rounding mode, but done more crudely via just using multiple different instructions (and the naming scheme wasn't super conistent).

Well, and say, both FMULA and FDIVA exist, but what they do is quite different:
FMULA giving a FMUL result at Binary32 equivalent precision;
FDIVA giving a crude FDIV approximation intended to start up an N-R process.

With 80-bit instructions supplying two more registers, a fused dot-
product can be done.

Rd = (Rs1*Rs2)+(Rs3*Rs4)

It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
port for the vector mask register plus a port for the rounding mode.

Dunno there.

In my case, I have a 6R3W regfile, but a 4R1W operation isn't really a
thing in my case, more:
3x 2R1W (64-bit)
2x 3R1W (64-bit)
1x 3R1W (128-bit, by ganging the first 2 lanes).

Also each lane only natively provides for Imm33, so an Imm64 is split
across 2 lanes.

An Imm128 couldn't actually be done as the immediate-handling path isn't
wide enough.

OK...

FWIW, I had been working on a new spec following the pattern of the
IsaDesc doc specifically for XG3: https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2026-04-17_XG3_IsaDesc.txt

Still needs a bit more work, thus far I mostly got it just past the
level of "scaffold" and copy/pasted a bunch of text from the other spec.

--- Synchronet 3.21f-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Tue Apr 21 07:07:49 2026

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> writes:

Stephen Fuld <[email protected]d> posted:

On 4/20/2026 11:36 AM, MitchAlsup wrote:

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Since * is commutative, sign control over * can be applied
to either R2 or R3

Actually, that's only true because

1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b

2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b

I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
or -0.

But in most code the difference is not important, so few programmers
write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).

In the few cases where the difference (if it exists) is important,
(-a)*(-b) can be encoded as negation followed by FMAC.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Tue Apr 21 06:43:28 2026

From Newsgroup: comp.arch

On 2026-04-21 2:33 a.m., Thomas Koenig wrote:

Robert Finch <[email protected]> schrieb:

The function code is not present in all instructions. Loads and stores
immediate operates and other miscellaneous instructions do not have the
field. They go by the seven bit primary opcode.

I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate
entropy? Something that can scan a binary file and rate the entropy?

Mitch has a very efficient packing of bits in his ISA, which has
32-bit istructions. It would be possible in theory (not suggesting
that you should do it :-) to take his encodings, make all register
specifiers 7 bit to accommodate your 128 registers (which would
give you 40 instead of 32 bits for four-register instructions)
and then wonder what to do with the holes left by the instructions
with fewer than four registers.

Tempting. But it would end up just being a different implementation of
that ISA. Some things like constants are done differently in Qupls5 ISA.
The Q+ ISA is supporting vector masking, which might be done with a
modified PRED modifier.

For Q+5 compares and branches are also handled differently.

I cannot seem to get variable length instructions to work at frequency
due probably mostly to routing.

To support something like Mitch's ISA or RISC-V it could be done by
modifying the parser in the front-end.

An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.

Momentarily thinking of a dynamically changing ISA based on the program
class. It could be controlled by a register in the CPU.

I don't think this is a good idea. I assume you would want to
reuse the opcode space (let's call the versions then A and B).
What if you use ISA A, and the program dynamically loads a library
using ISA B? Or is this something that each function would
have to set dynamically?

I think a program would need to be linked against a library with the
same ISA. I think this is already done for building software for
different machines.

I was thinking of hopping ISAs on a dynamic basis, like frequency
hopping. While SW is running.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Apr 21 17:59:45 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> posted:

On 2026-04-20 7:37 p.m., BGB wrote:

On 4/20/2026 1:55 PM, Thomas Koenig wrote:

----------------------

The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

I have code out of LLVM compiler that has 42 FMAC FU instructions
in a row. {Many have constants thus no LDs or STs, several have
FDIV and/or elementary transcendentals {SIN(), COS(), LN(), EXP()}

I have found the breakdown of opcode and function code to be very
packed, most of the codes and combinations of codes are used. I do not
think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

Take the size in bytes and compare against your favorite competitor.

An issue may be the support for instructions that are used for only
specific types of programs. There are lots of instructions used for half
and single precision, but if a program is only using double precision
then these are wasted.

Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

Each mode adds 1 to the exponent of verification complexity.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Tue Apr 21 21:06:25 2026

From Newsgroup: comp.arch

Anton Ertl wrote:

MitchAlsup <[email protected]d> writes:

Stephen Fuld <[email protected]d> posted:

On 4/20/2026 11:36 AM, MitchAlsup wrote:

Sign Control gives me:

FMAC Rd= R1+R2*R3
FMAC Rd=-R1+R2*R3
FMAC Rd= R1-R2*R3
FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Since * is commutative, sign control over * can be applied
to either R2 or R3

Actually, that's only true because

1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b

2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b

I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
or -0.

They do, including the single interesting case of both a & b being +/- zero:
+ * + -> +
+ * - -> -
etc

When just one operand is zero, the result is also zero, and the sign
follows the usual product rules:

positive a * -0.0 -> -0.0

But in most code the difference is not important, so few programmers
write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).

Should not be needed. We try hard to attain minimum surprise factor.

In the few cases where the difference (if it exists) is important,
(-a)*(-b) can be encoded as negation followed by FMAC.

Ditto.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Wed Apr 22 14:28:05 2026

From Newsgroup: comp.arch

On 4/21/2026 2:06 PM, Terje Mathisen wrote:

Anton Ertl wrote:

MitchAlsup <[email protected]d> writes:

Stephen Fuld <[email protected]d> posted:

On 4/20/2026 11:36 AM, MitchAlsup wrote:

Sign Control gives me:

       FMAC Rd= R1+R2*R3
       FMAC Rd=-R1+R2*R3
       FMAC Rd= R1-R2*R3
       FMAC Rd=-R1-R2*R3

Is there no need for sign control of R3?

Since * is commutative, sign control over * can be applied
to either R2 or R3

Actually, that's only true because

1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b

2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b

I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
or -0.

They do, including the single interesting case of both a & b being +/-
zero:
+ * + -> +
+ * - -> -
etc

When just one operand is zero, the result is also zero, and the sign
follows the usual product rules:

positive a * -0.0 -> -0.0

The rules follow, but ironically I have run into code before that
(ab)used floating point in such a way that actually following the IEEE
rules caused things to break.

Seemingly, always producing +0 was the way to make the code work.

But in most code the difference is not important, so few programmers
write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).

Should not be needed. We try hard to attain minimum surprise factor.

In the few cases where the difference (if it exists) is important,
(-a)*(-b) can be encoded as negation followed by FMAC.

Ditto.

Meanwhile, I am left to consider the possibility of a special case of a multiply where:
+ * + => +
- * - => -
Others: Unknown (either AND or OR)

Mostly for sake of a possible FSSQR operator.
But, this is niche enough that it is not obvious whether it would be justified.

Did end up crossing the threshold of adding a special case constant-load instruction for repeating the same 16 bits 4 times.
PLDCSW Imm16u, Rn
Rn = Imm16u | (Imm16u<<16) | (Imm16u<<32) | (Imm16u<<48)

While not super-common, it was (excluding floating point values) the
most common case for values that failed with one of the other constant-load-cases.

Looking over the dumped fail-cases (that required a full 64-bit
constant), thus far the majority (in the test program I was looking at)
appear to be things like fractions with an NPOT divisor:
1/3, 1/5, 1/7, 1/9,
2/3, 2/7, ...

In this case, the fractions dominate over the x.yyy pattern seen in
wider searches.

Many could fit a pattern though of, say, collapsing down to 32-bits by omitting the middle part, say:
(63:36), (3:0)
Then, unpacking is, say:
(31:4), (11:4), (11:4), (11:4), (11:4), ( 3:0)

Though, debatable if worth it, as there are relatively few of them (and
would merely reduce a 96-bit encoding to a 64-bit encoding).

In many of these fractions, this middle section is merely a repeating
byte, so this is how these would be compressible. It could also work for
many of the x.yyy cases (many fill the low-order bits with a simple
repeating pattern, apart from whatever rounding happens in the last nybble).

Doesn't look like the pattern is predictable enough to cram it down into
a 16-bit format though.

...

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Fri Apr 24 10:55:16 2026

From Newsgroup: comp.arch

For Q+5, the vector mask register, status register selected and round
mode selection are now specified using instruction modifiers. The
modifiers apply for groups of up to eight instructions.

There is already a way to specify the vector mask over a group of
instructions in the Arpl language:
vector_mask mvar;
vector float res, a, b;
res = mvar(a + b);

A similar paradigm could be used to specify the status register and
round mode selection:
res = __frm(a+b,RNE);
res = __fstat(a+b,1); // update status register 1

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Fri Apr 24 22:53:16 2026

From Newsgroup: comp.arch

Q+5 now with vertical instruction encoding technology (another name for modifiers).

Added a vector mask VMASK modifier to go along with the PRED modifier.

Specifies the vector mask register (vm0 to vm7) for each following instruction.

Messy logic that must go in the decode stage before rename.

Strange to think that vertically encoded instructions are actually
encoded horizontally. And a vertical list of instructions is encoded horizontally. Right up there with micro-ops.

--- Synchronet 3.21f-Linux NewsLink 1.2

Who's Online
Recent Visitors
- D2sk
  Sat Apr 25 08:20:10 2026
  from Fort Smith, Ar via Raw
- Noozle
  Sat Apr 25 08:05:40 2026
  from Noozle City via Telnet
- Microbot
  Sat Apr 25 02:25:30 2026
  from Moore, Ok via Telnet
- Kaptain_Krawdad
  Fri Apr 24 21:39:52 2026
  from Southern Il via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,114
Nodes:	10 (0 / 10)
Uptime:	492511:55:43
Calls:	14,267
Calls today:	3
Files:	186,320
D/L today:	26,179 files (8,480M bytes)
Messages:	2,518,387

Q+ status / post-op instructions

Who's Online

Recent Visitors

System Info