• Q+ status / post-op instructions

    From Robert Finch@[email protected] to comp.arch on Sat Apr 18 21:24:52 2026
    From Newsgroup: comp.arch

    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of instructions. 40-bit instructions will save 17% on the code space while
    being better than 95% as effective as the 48-bit instructions. The dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    The primary use of the post-op postfix is to supply an additional
    register for instructions like FMA or bitfield operations but it can
    also provide a second operation.

    The post-op is performed between the result of the first two source
    operands and a third operand supplied by the post-op postfix. The trick
    is that the post-op is treated as part of the first instruction by the
    CPU. Both the original op and the post-op are performed by the ALU at
    the same time. So, post-ops are almost fused instructions.

    Dual operand instructions were used about 0.2% of the time in a small
    sample of compiled code. IDK if they are worth it or not. I set my
    cutoff at 0.1% useful.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Apr 19 18:28:25 2026
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of instructions.

    How do you support FMAC ??

    40-bit instructions will save 17% on the code space while being better than 95% as effective as the 48-bit instructions. The dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    The primary use of the post-op postfix is to supply an additional
    register for instructions like FMA or bitfield operations but it can
    also provide a second operation.

    Paying extra for THE workhorse FP calculation...

    The post-op is performed between the result of the first two source
    operands and a third operand supplied by the post-op postfix. The trick
    is that the post-op is treated as part of the first instruction by the
    CPU. Both the original op and the post-op are performed by the ALU at
    the same time. So, post-ops are almost fused instructions.

    Dual operand instructions were used about 0.2% of the time in a small
    sample of compiled code. IDK if they are worth it or not. I set my
    cutoff at 0.1% useful.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sun Apr 19 20:16:40 2026
    From Newsgroup: comp.arch

    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of instructions. 40-bit instructions will save 17% on the code space while being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers. Six bits per register
    leves 16 bit for opcodes for four-register operations. Add two
    sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc

    which still leaves you 14 bits for opcode space. I would have a lot
    of trouble filling that opcode space :-)

    But FMA is used a lot, this should be one instruction really.

    The
    dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    Four operands?

    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Sun Apr 19 21:49:03 2026
    From Newsgroup: comp.arch

    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of
    instructions. 40-bit instructions will save 17% on the code space while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers. Six bits per register
    leves 16 bit for opcodes for four-register operations. Add two
    sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven
    (5+2) bits. However, two bits of the primary opcode are used to specify
    the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result
    status.

    The opcode looks like:
    SFFFFFFMMrRRRVV222222111111DDDDDDOOOOOPP

    S=select exception status reg. 0 or 1
    F=function code, identifies FMA and others
    M=select constant for 1 or 2
    r=record result status in ccr1
    R=rounding mode
    V=vector mask register to use
    2=2nd source operand
    1=1st source operand
    D=destination operand
    O=Primary Opcode
    P=precision

    It is not impossible to define FMA at the root level, using the function
    code bits for a register instead.

    Note the use of the postfix is only a code density issue. It is absorbed
    by the CPU and the FMA and postfix execute as one instruction so the
    dynamic instruction count is not any different.
    Internally, the micro-ops have room for three source operands.

    I guess it is a matter of how often is FMA used. There are also FADD,
    FSUB, FMUL instructions defined that are mapped to FMA internally which
    do not require a postfix.

    Q+ version 4 uses a 48-bit instruction that has room for three source registers, but most of the time the third register is not needed.

    which still leaves you 14 bits for opcode space. I would have a lot
    of trouble filling that opcode space :-)

    But FMA is used a lot, this should be one instruction really.
    It operates like one 80-bit instruction with lots of unused bits.

    The
    dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    Four operands?

    Potentially allows fused-dot-product. It may help with some other
    instructions perhaps reduction operations.
    [...]




    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Mon Apr 20 02:50:18 2026
    From Newsgroup: comp.arch

    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of
    instructions. 40-bit instructions will save 17% on the code space while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers. Six bits per register
    leves 16 bit for opcodes for four-register operations. Add two
    sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc

    which still leaves you 14 bits for opcode space. I would have a lot
    of trouble filling that opcode space :-)

    But FMA is used a lot, this should be one instruction really.


    IMHO:
    Can do the basic case as a single-width 3R instruction.
    Rd = Rd + Rs * Rt

    Then, for the other cases, and for 4R, switch over to a longer encoding.

    ...


    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
    FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    This is comparable to the space XG3 burns on Jumbo Prefixes.
    IMO, Jumbo Prefixes are encoding space better spent than 4R FMAC.



    Well, and FMAC is one and done.

    The scope of Jumbo Prefixes can expand gracefully.


    Like (in XG3), I can add immediate forms of the 128-bit ALU instructions:
    ADDX Rm, Imm33s, Rn
    ANDX Rm, Imm33s, Rn
    And, also Imm64 forms:
    ADDX Rm, Imm64, Rn
    ANDX Rm, Imm64, Rn

    ISA level changes needed? Basically none.

    I mostly just needed to tweak some decoding logic such that the
    high-half of the ALU gets fed a sign extension of the low half (rather
    than a mirrored copy). Otherwise, it is immediate synthesis on an
    existing instruction.


    No Imm128 forms ATM though.

    Something like:
    __int128 a, b, c;
    ...
    c = a + 0x123456789ABCDEF123456789ABCDEFUI128;
    Well, this is gonna require at least 3 instructions...

    But, could at least in theory be single instruction if the immediate
    fits within 64 bits.


    Currently no good/obvious way to map this sort of stuff over to RISC-V
    land though.

    It is possible that I could define an "AWX" or similar special case via
    the J21O scheme, could maybe, at least, get, say:
    ADDX Fd, Fs, Ft
    ADDX Fd, Fs, Imm17s


    And, with an ADDX and LI_Imm33 and SHORI32_Imm32, could in theory get
    the previous scenario down to 5 instructions:
    LI F2, Imm63_32
    LI F3, Imm127_96
    SHORI F2, Imm31_0
    SHORI F3, Imm95_64
    ADDX F12, F10, F2

    ...


    Equivalent operation would take 28 bytes to encode in XG3, and 40 bytes
    in RV64+Jx. Or, maybe get it down to 32 bytes with the J52I prefixes.

    Well, and if jumbo prefixes are a hard sell for RV land, 128-bit ALU ops
    are likely to be worse.


    ...


    The
    dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    Four operands?

    [...]




    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Mon Apr 20 06:21:07 2026
    From Newsgroup: comp.arch

    On 2026-04-20 3:50 a.m., BGB wrote:
    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of >>> instructions. 40-bit instructions will save 17% on the code space while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers.  Six bits per register
    leves 16 bit for opcodes for four-register operations.  Add two
    sign bits so you can have

       Rd = +/- Ra * Rb +/- Rc

    which still leaves you 14 bits for opcode space.  I would have a lot
    of trouble filling that opcode space :-)

    But FMA is used a lot, this should be one instruction really.


    IMHO:
    Can do the basic case as a single-width 3R instruction.
      Rd = Rd + Rs * Rt

    I am not fond of destructive ops. It sometimes uses up an extra register
    and instruction.

    Then, for the other cases, and for 4R, switch over to a longer encoding.

    ...

    Turns out there is not enough room at the root level to add FMA at four different precisions with sign control (16 opcodes). So, I just added
    FMA (no sign control) with three different precisions at the root level.
    The other combinations of FMA will need to be handled with a wider
    instruction (80 bits).

    Now that I think about it, it may be better to have FMA with sign
    control at double precision at the root opcode level, and leave other precisions with wider formats.


    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
      FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    What I was trying to avoid. FMA uses a lot of opcode space for just one instruction that for some programs is not used. At the same time some
    programs depending on float math use it a lot.

    I suppose the ISA could be re-configed depending on the program. I
    wonder how difficult it would be to use the RISCV ISA. Scratches head.

    This is comparable to the space XG3 burns on Jumbo Prefixes.
      IMO, Jumbo Prefixes are encoding space better spent than 4R FMAC.



    Well, and FMAC is one and done.

    The scope of Jumbo Prefixes can expand gracefully.


    Like (in XG3), I can add immediate forms of the 128-bit ALU instructions:
      ADDX Rm, Imm33s, Rn
      ANDX Rm, Imm33s, Rn
    And, also Imm64 forms:
      ADDX Rm, Imm64, Rn
      ANDX Rm, Imm64, Rn

    ISA level changes needed? Basically none.

    I handle this with piecemeal post-fixes so that a fixed size instruction
    can be used. Which is basically a single variable width instruction but
    it has the root opcode (= a NOP) embedded at the start locations where
    an instruction might start.

    I have not figured out a good way to manage the increment for variable
    sized instructions. I am running a simpler machine at the cost of some
    code density. I wonder though, that the implementation has too much of
    an impact on the ISA. It may be better to go with variable length
    instructions with a lousy implementation of PC increment to keep the ISA cleaner.

    I mostly just needed to tweak some decoding logic such that the high-
    half of the ALU gets fed a sign extension of the low half (rather than a mirrored copy). Otherwise, it is immediate synthesis on an existing instruction.


    No Imm128 forms ATM though.

    Something like:
      __int128 a, b, c;
      ...
      c = a + 0x123456789ABCDEF123456789ABCDEFUI128;
    Well, this is gonna require at least 3 instructions...


    But, could at least in theory be single instruction if the immediate
    fits within 64 bits.


    Currently no good/obvious way to map this sort of stuff over to RISC-V
    land though.

    It is possible that I could define an "AWX" or similar special case via
    the J21O scheme, could maybe, at least, get, say:
      ADDX Fd, Fs, Ft
      ADDX Fd, Fs, Imm17s


    And, with an ADDX and LI_Imm33 and SHORI32_Imm32, could in theory get
    the previous scenario down to 5 instructions:
      LI     F2, Imm63_32
      LI     F3, Imm127_96
      SHORI  F2, Imm31_0
      SHORI  F3, Imm95_64
      ADDX   F12, F10, F2

    ...


    Equivalent operation would take 28 bytes to encode in XG3, and 40 bytes
    in RV64+Jx. Or, maybe get it down to 32 bytes with the J52I prefixes.

    Well, and if jumbo prefixes are a hard sell for RV land, 128-bit ALU ops
    are likely to be worse.


    I cannot see there being a lot of use for 128-bit immediates, except for
    when one wants to use an immediate for a SIMD operation in which case
    all the bits are needed.

    ...


    The
    dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    Four operands?

    [...]





    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 18:20:28 2026
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of >> instructions. 40-bit instructions will save 17% on the code space while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers. Six bits per register
    leves 16 bit for opcodes for four-register operations. Add two
    sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven
    (5+2) bits. However, two bits of the primary opcode are used to specify
    the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result status.

    The opcode looks like:
    SFFFFFFMMrRRRVV222222111111DDDDDDOOOOOPP

    S=select exception status reg. 0 or 1
    F=function code, identifies FMA and others
    M=select constant for 1 or 2
    r=record result status in ccr1
    R=rounding mode
    V=vector mask register to use
    2=2nd source operand
    1=1st source operand
    D=destination operand
    O=Primary Opcode
    P=precision

    May I suggest:: 6-bit Function code and 5-bit Primary OpCode
    and 2-bit precision seems to be excessive bit count for what
    you are getting out of them.

    Additionally, 3-bits for 5-states {RM} is a waste of entropy.

    What I did to address the waste of entropy was to define 4-bits
    to cover all the operand sign control and insertion of constants
    {5-bit tiny, 32-bit normal, 64-bit large}. By leaving out seldom
    used patterns the independent desires are crammed into fewer bits.

    It is not impossible to define FMA at the root level, using the function code bits for a register instead.

    Note the use of the postfix is only a code density issue. It is absorbed
    by the CPU and the FMA and postfix execute as one instruction so the
    dynamic instruction count is not any different.
    Internally, the micro-ops have room for three source operands.

    I guess it is a matter of how often is FMA used. There are also FADD,
    FSUB, FMUL instructions defined that are mapped to FMA internally which
    do not require a postfix.

    Q+ version 4 uses a 48-bit instruction that has room for three source registers, but most of the time the third register is not needed.

    which still leaves you 14 bits for opcode space. I would have a lot
    of trouble filling that opcode space :-)

    But FMA is used a lot, this should be one instruction really.
    It operates like one 80-bit instruction with lots of unused bits.

    The
    dual-operation instructions are replaced with post-ops. To use a third
    or a fourth operand a postfix instruction is needed.

    Four operands?

    Potentially allows fused-dot-product. It may help with some other instructions perhaps reduction operations.
    [...]




    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 18:36:13 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    --------------------
    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
    FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    My 66000 has a subGroup of 8 instructions that cover all uses of
    3-operand 1-result (4-register) Major = {010xxx} So,
    Major 6-bits
    Result 5-bits
    Source 15-bits
    --------------
    26-bits
    leaving 6-bits which I then use
    2-bits for size {precision}
    4-bits Signs/Constant substitution

    Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
    CMOV, LOOP; with 3 left unassigned.

    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Constant insertion gives me:

    FMAC Rd= Im5+R2*R3
    FMAC Rd= Rd+Im5*R3
    FMAC Rd= F32+R2*R3
    FMAC Rd= R1+F32*R3
    FMAC Rd= F64+R2*R3
    FMAC Rd= R1+F64*R3

    From 1 instruction.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 18:43:59 2026
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    On 2026-04-20 3:50 a.m., BGB wrote:
    --------------------------
    IMHO:
    Can do the basic case as a single-width 3R instruction.
      Rd = Rd + Rs * Rt

    I am not fond of destructive ops. It sometimes uses up an extra register
    and instruction.

    Nor am I.

    Then, for the other cases, and for 4R, switch over to a longer encoding.

    ...

    Turns out there is not enough room at the root level to add FMA at four different precisions with sign control (16 opcodes).

    See my immediately previous post on how I got it done in 32-bits {for
    the no constants case and the 5-bit constant case}.

    When an 5-bit immediate is used as a constant in a FP calculation,
    it represents the range {-15.5..+15.5} instead of {-32..31}.

    -----------------
    I cannot see there being a lot of use for 128-bit immediates, except for when one wants to use an immediate for a SIMD operation in which case
    all the bits are needed.

    I see not enough use of 128-bit to waste entropy in accommodating.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Mon Apr 20 18:55:51 2026
    From Newsgroup: comp.arch

    Robert Finch <[email protected]> schrieb:
    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of >>> instructions. 40-bit instructions will save 17% on the code space while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers. Six bits per register
    leves 16 bit for opcodes for four-register operations. Add two
    sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven
    (5+2) bits. However, two bits of the primary opcode are used to specify
    the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result status.

    Is the rounding mode really needed in every instruction? You would
    need a dynamic rounding mode anyway, and this could save you
    three bits. There will likely not be many instructions with four
    registers, so you could fewer bits for that particular opcode group.
    Having two bits of your primary opcode space always reserved for
    precision also seems a lot; there should be operations where this
    is not needed.

    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Apr 20 12:17:55 2026
    From Newsgroup: comp.arch

    On 4/20/2026 11:36 AM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    --------------------
    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
    FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    My 66000 has a subGroup of 8 instructions that cover all uses of
    3-operand 1-result (4-register) Major = {010xxx} So,
    Major 6-bits
    Result 5-bits
    Source 15-bits
    --------------
    26-bits
    leaving 6-bits which I then use
    2-bits for size {precision}
    4-bits Signs/Constant substitution

    Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
    CMOV, LOOP; with 3 left unassigned.


    I am far from a applied mathematician, so these may be silly, but . . .



    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?


    Constant insertion gives me:

    FMAC Rd= Im5+R2*R3
    FMAC Rd= Rd+Im5*R3
    FMAC Rd= F32+R2*R3
    FMAC Rd= R1+F32*R3
    FMAC Rd= F64+R2*R3
    FMAC Rd= R1+F64*R3

    Similarly to my above question, is there no need for an immediate for R3?

    Note that if you have an immediate value of zero for for one of the
    addends, you have an FP multiply, so could use that to eliminate an op
    code (probably special case the hardware for speed). Similarly, if you
    had an immediate for R3 and had a value of one for it, you have defined
    an FP add, so could eliminate another op code.



    From 1 instruction.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Mon Apr 20 20:35:10 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> schrieb:
    On 4/20/2026 11:36 AM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    --------------------
    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
    FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    My 66000 has a subGroup of 8 instructions that cover all uses of
    3-operand 1-result (4-register) Major = {010xxx} So,
    Major 6-bits
    Result 5-bits
    Source 15-bits
    --------------
    26-bits
    leaving 6-bits which I then use
    2-bits for size {precision}
    4-bits Signs/Constant substitution

    Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
    CMOV, LOOP; with 3 left unassigned.


    I am far from a applied mathematician, so these may be silly, but . . .



    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?

    Not sure what for - sign control on the product should be
    enough :-)



    Constant insertion gives me:

    FMAC Rd= Im5+R2*R3
    FMAC Rd= Rd+Im5*R3
    FMAC Rd= F32+R2*R3
    FMAC Rd= R1+F32*R3
    FMAC Rd= F64+R2*R3
    FMAC Rd= R1+F64*R3

    Similarly to my above question, is there no need for an immediate for R3?

    R2 and R3 are interchangeable.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Apr 20 21:24:39 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 4/20/2026 11:36 AM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    --------------------
    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
    FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    My 66000 has a subGroup of 8 instructions that cover all uses of
    3-operand 1-result (4-register) Major = {010xxx} So,
    Major 6-bits
    Result 5-bits
    Source 15-bits
    --------------
    26-bits
    leaving 6-bits which I then use
    2-bits for size {precision}
    4-bits Signs/Constant substitution

    Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS, CMOV, LOOP; with 3 left unassigned.


    I am far from a applied mathematician, so these may be silly, but . . .



    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?

    Since * is commutative, sign control over * can be applied
    to either R2 or R3, I chose sign control over R1 and R2 and
    not over R3; then carefully chose which reg is +() and which
    are (*).


    Constant insertion gives me:

    FMAC Rd= Im5+R2*R3
    FMAC Rd= Rd+Im5*R3
    FMAC Rd= F32+R2*R3
    FMAC Rd= R1+F32*R3
    FMAC Rd= F64+R2*R3
    FMAC Rd= R1+F64*R3

    Similarly to my above question, is there no need for an immediate for R3?

    Again, * is commutative.

    Note that if you have an immediate value of zero for for one of the
    addends, you have an FP multiply, so could use that to eliminate an op
    code (probably special case the hardware for speed). Similarly, if you
    had an immediate for R3 and had a value of one for it, you have defined
    an FP add, so could eliminate another op code.

    I have both {values 0 and 1} available, but I also have FADD and FMUL instructions. Given a Great-Big machine, one will likely have all 3
    function units, so FMUL can be routed to FMUL FU or FMAC FU, likewise
    for FADD to FADD or FMAC. In a Little-Bitty machine, FMAC can be the
    only FP FU.



    From 1 instruction.



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Apr 20 15:10:26 2026
    From Newsgroup: comp.arch

    On 4/20/2026 1:35 PM, Thomas Koenig wrote:
    Stephen Fuld <[email protected]d> schrieb:
    On 4/20/2026 11:36 AM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 4/19/2026 3:16 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    --------------------
    Trying to do 4R in a single instruction word is ineffective use of
    encoding space.

    RISC-V falls into this trap in a few cases:
    FMADD/FMSUB/FNMADD/FMNSUB using 27 bits of encoding space.

    My 66000 has a subGroup of 8 instructions that cover all uses of
    3-operand 1-result (4-register) Major = {010xxx} So,
    Major 6-bits
    Result 5-bits
    Source 15-bits
    --------------
    26-bits
    leaving 6-bits which I then use
    2-bits for size {precision}
    4-bits Signs/Constant substitution

    Of the 8 potential instructions: 1 is permanently reserved, FMAC, INS,
    CMOV, LOOP; with 3 left unassigned.


    I am far from a applied mathematician, so these may be silly, but . . .



    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?

    Not sure what for - sign control on the product should be
    enough :-)

    DUH! I feel so stupid. :-(



    Constant insertion gives me:

    FMAC Rd= Im5+R2*R3
    FMAC Rd= Rd+Im5*R3
    FMAC Rd= F32+R2*R3
    FMAC Rd= R1+F32*R3
    FMAC Rd= F64+R2*R3
    FMAC Rd= R1+F64*R3

    Similarly to my above question, is there no need for an immediate for R3?

    R2 and R3 are interchangeable.

    Yup. DUH! again.

    Thank you.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Mon Apr 20 18:37:23 2026
    From Newsgroup: comp.arch

    On 4/20/2026 1:55 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the size of >>>> instructions. 40-bit instructions will save 17% on the code space while >>>> being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers. Six bits per register
    leves 16 bit for opcodes for four-register operations. Add two
    sign bits so you can have

    Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven
    (5+2) bits. However, two bits of the primary opcode are used to specify
    the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result
    status.

    Is the rounding mode really needed in every instruction? You would
    need a dynamic rounding mode anyway, and this could save you
    three bits. There will likely not be many instructions with four
    registers, so you could fewer bits for that particular opcode group.
    Having two bits of your primary opcode space always reserved for
    precision also seems a lot; there should be operations where this
    is not needed.


    IMHO: No.


    For fixed rounding modes, in my case they are either:
    RNE, Default, so nothing special needed;
    DYN, Also instructions exists for this.
    DYN fetches the RM from FPSCR;
    Others: Jumbo Prefix.

    Likewise for 4R (non-destructive FMAC).

    So, single 3R op as basic case:
    FMAC Rm, Ro, Rn
    Where:
    Rn=Rn+Rm*Ro

    But, then with the Jumbo Prefix one can get:
    * FMAC: Rn=Rp+Rm*Ro
    * FMAS: Rn=Rm*Ro-Rp
    * FMRS: Rn=Rp-Rm*Ro
    * FMRA: Rn=-(Rp+Rm*Ro)

    With some bits to also specify things like rounding mode and SIMD
    variants if desired.

    There is a big tradeoff though where there are a few orders of magnitude
    of performance difference based on whether it needs to be single-rounded
    (IOW: Don't try to do "Double-Double" with this thing).



    Seemingly failed to mention this earlier, getting more distracted with thinking about Int128 it seems...



    ...


    [...]

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Mon Apr 20 21:51:18 2026
    From Newsgroup: comp.arch

    On 2026-04-20 7:37 p.m., BGB wrote:
    On 4/20/2026 1:55 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two
    source operands per instruction instead of three to decrease the
    size of
    instructions. 40-bit instructions will save 17% on the code space
    while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers.  Six bits per register >>>> leves 16 bit for opcodes for four-register operations.  Add two
    sign bits so you can have

        Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven
    (5+2) bits. However, two bits of the primary opcode are used to specify
    the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result
    status.

    Is the rounding mode really needed in every instruction?  You would
    need a dynamic rounding mode anyway, and this could save you
    three bits.  There will likely not be many instructions with four
    registers, so you could fewer bits for that particular opcode group.
    Having two bits of your primary opcode space always reserved for
    precision also seems a lot; there should be operations where this
    is not needed.


    IMHO: No.


    For fixed rounding modes, in my case they are either:
      RNE, Default, so nothing special needed;
      DYN, Also instructions exists for this.
        DYN fetches the RM from FPSCR;
      Others: Jumbo Prefix.

    Likewise for 4R (non-destructive FMAC).

    So, single 3R op as basic case:
      FMAC Rm, Ro, Rn
    Where:
      Rn=Rn+Rm*Ro

    But, then with the Jumbo Prefix one can get:
    * FMAC: Rn=Rp+Rm*Ro
    * FMAS: Rn=Rm*Ro-Rp
    * FMRS: Rn=Rp-Rm*Ro
    * FMRA: Rn=-(Rp+Rm*Ro)

    With some bits to also specify things like rounding mode and SIMD
    variants if desired.

    There is a big tradeoff though where there are a few orders of magnitude
    of performance difference based on whether it needs to be single-rounded (IOW: Don't try to do "Double-Double" with this thing).



    Seemingly failed to mention this earlier, getting more distracted with thinking about Int128 it seems...



    ...


    [...]


    The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the
    field. They go by the seven bit primary opcode.

    I have found the breakdown of opcode and function code to be very
    packed, most of the codes and combinations of codes are used. I do not
    think there is much entropy wasted. Is there a tool that can estimate
    entropy? Something that can scan a binary file and rate the entropy?

    An issue may be the support for instructions that are used for only
    specific types of programs. There are lots of instructions used for half
    and single precision, but if a program is only using double precision
    then these are wasted.

    Momentarily thinking of a dynamically changing ISA based on the program
    class. It could be controlled by a register in the CPU.

    There are six combinations for rounding modes including DYN rounding. I
    found a potential use for another round code (statistical or random
    rounding fed from an entropy source), so I am hesitant to reduce these.

    Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
    bits are not present the bits are used to extend the register selection
    to the full 128 registers.
    Instructions with a rounding mode are limited to 64 registers.

    With 80-bit instructions supplying two more registers, a fused
    dot-product can be done.

    Rd = (Rs1*Rs2)+(Rs3*Rs4)

    It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
    port for the vector mask register plus a port for the rounding mode.











    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Mon Apr 20 22:50:14 2026
    From Newsgroup: comp.arch

    On 2026-04-20 9:51 p.m., Robert Finch wrote:
    On 2026-04-20 7:37 p.m., BGB wrote:
    On 4/20/2026 1:55 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two >>>>>> source operands per instruction instead of three to decrease the
    size of
    instructions. 40-bit instructions will save 17% on the code space >>>>>> while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers.  Six bits per register >>>>> leves 16 bit for opcodes for four-register operations.  Add two
    sign bits so you can have

        Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven >>>> (5+2) bits. However, two bits of the primary opcode are used to specify >>>> the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result
    status.

    Is the rounding mode really needed in every instruction?  You would
    need a dynamic rounding mode anyway, and this could save you
    three bits.  There will likely not be many instructions with four
    registers, so you could fewer bits for that particular opcode group.
    Having two bits of your primary opcode space always reserved for
    precision also seems a lot; there should be operations where this
    is not needed.


    IMHO: No.


    For fixed rounding modes, in my case they are either:
       RNE, Default, so nothing special needed;
       DYN, Also instructions exists for this.
         DYN fetches the RM from FPSCR;
       Others: Jumbo Prefix.

    Likewise for 4R (non-destructive FMAC).

    So, single 3R op as basic case:
       FMAC Rm, Ro, Rn
    Where:
       Rn=Rn+Rm*Ro

    But, then with the Jumbo Prefix one can get:
    * FMAC: Rn=Rp+Rm*Ro
    * FMAS: Rn=Rm*Ro-Rp
    * FMRS: Rn=Rp-Rm*Ro
    * FMRA: Rn=-(Rp+Rm*Ro)

    With some bits to also specify things like rounding mode and SIMD
    variants if desired.

    There is a big tradeoff though where there are a few orders of
    magnitude of performance difference based on whether it needs to be
    single-rounded (IOW: Don't try to do "Double-Double" with this thing).



    Seemingly failed to mention this earlier, getting more distracted with
    thinking about Int128 it seems...



    ...


    [...]


    The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

    I have found the breakdown of opcode and function code to be very
    packed, most of the codes and combinations of codes are used. I do not
    think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

    I found an entropy measurer on the web and built it.
    Entropy for Q+4 was only 0.21 out of 8 for the boot file.

    An issue may be the support for instructions that are used for only
    specific types of programs. There are lots of instructions used for half
    and single precision, but if a program is only using double precision
    then these are wasted.

    Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

    There are six combinations for rounding modes including DYN rounding. I found a potential use for another round code (statistical or random
    rounding fed from an entropy source), so I am hesitant to reduce these.

    Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
    bits are not present the bits are used to extend the register selection
    to the full 128 registers.
    Instructions with a rounding mode are limited to 64 registers.

    With 80-bit instructions supplying two more registers, a fused dot-
    product can be done.

    Rd = (Rs1*Rs2)+(Rs3*Rs4)

    It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
    port for the vector mask register plus a port for the rounding mode.












    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Tue Apr 21 06:33:45 2026
    From Newsgroup: comp.arch

    Robert Finch <[email protected]> schrieb:

    The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

    I have found the breakdown of opcode and function code to be very
    packed, most of the codes and combinations of codes are used. I do not
    think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

    Mitch has a very efficient packing of bits in his ISA, which has
    32-bit istructions. It would be possible in theory (not suggesting
    that you should do it :-) to take his encodings, make all register
    specifiers 7 bit to accommodate your 128 registers (which would
    give you 40 instead of 32 bits for four-register instructions)
    and then wonder what to do with the holes left by the instructions
    with fewer than four registers.

    An issue may be the support for instructions that are used for only
    specific types of programs. There are lots of instructions used for half
    and single precision, but if a program is only using double precision
    then these are wasted.

    Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

    I don't think this is a good idea. I assume you would want to
    reuse the opcode space (let's call the versions then A and B).
    What if you use ISA A, and the program dynamically loads a library
    using ISA B? Or is this something that each function would
    have to set dynamically?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Tue Apr 21 01:36:09 2026
    From Newsgroup: comp.arch

    On 4/20/2026 8:51 PM, Robert Finch wrote:
    On 2026-04-20 7:37 p.m., BGB wrote:
    On 4/20/2026 1:55 PM, Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    On 2026-04-19 4:16 p.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:
    Working on Q+ version 5 now. Version 5 is only going to support two >>>>>> source operands per instruction instead of three to decrease the
    size of
    instructions. 40-bit instructions will save 17% on the code space >>>>>> while
    being better than 95% as effective as the 48-bit instructions.

    You have 40 bit instructions and 64 registers.  Six bits per register >>>>> leves 16 bit for opcodes for four-register operations.  Add two
    sign bits so you can have

        Rd = +/- Ra * Rb +/- Rc
    There are about 34 bits. The FMA is part of a group of float-ops
    specified by a six bit function code. The primary opcode is also seven >>>> (5+2) bits. However, two bits of the primary opcode are used to specify >>>> the precision. Two bits are used to specify a vector mask. Three bits
    are used to specify a rounding mode. And one bit to record the result
    status.

    Is the rounding mode really needed in every instruction?  You would
    need a dynamic rounding mode anyway, and this could save you
    three bits.  There will likely not be many instructions with four
    registers, so you could fewer bits for that particular opcode group.
    Having two bits of your primary opcode space always reserved for
    precision also seems a lot; there should be operations where this
    is not needed.


    IMHO: No.


    For fixed rounding modes, in my case they are either:
       RNE, Default, so nothing special needed;
       DYN, Also instructions exists for this.
         DYN fetches the RM from FPSCR;
       Others: Jumbo Prefix.

    Likewise for 4R (non-destructive FMAC).

    So, single 3R op as basic case:
       FMAC Rm, Ro, Rn
    Where:
       Rn=Rn+Rm*Ro

    But, then with the Jumbo Prefix one can get:
    * FMAC: Rn=Rp+Rm*Ro
    * FMAS: Rn=Rm*Ro-Rp
    * FMRS: Rn=Rp-Rm*Ro
    * FMRA: Rn=-(Rp+Rm*Ro)

    With some bits to also specify things like rounding mode and SIMD
    variants if desired.

    There is a big tradeoff though where there are a few orders of
    magnitude of performance difference based on whether it needs to be
    single-rounded (IOW: Don't try to do "Double-Double" with this thing).



    Seemingly failed to mention this earlier, getting more distracted with
    thinking about Int128 it seems...



    ...


    [...]


    The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

    I have found the breakdown of opcode and function code to be very
    packed, most of the codes and combinations of codes are used. I do not
    think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?


    Dunno there.

    I had mostly been doing everything manually, looking at stats and text
    files and looking for patterns.


    But, as noted, the majority of ops in my case end up being 32 bits.
    And, it seems I have recently started doing a little better on the code density front despite the lack of 16-bit ops in the case of XG3.


    An issue may be the support for instructions that are used for only
    specific types of programs. There are lots of instructions used for half
    and single precision, but if a program is only using double precision
    then these are wasted.


    I mostly ended up using Binary64 for all the scalar floating point in registers, but with more compact formats often being used in memory.

    This partly changed with SIMD and RV support.



    Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

    There are six combinations for rounding modes including DYN rounding. I found a potential use for another round code (statistical or random
    rounding fed from an entropy source), so I am hesitant to reduce these.

    Not all FP instructions include a rounding mode. It is only in the instructions where rounding makes sense. However, when rounding mode
    bits are not present the bits are used to extend the register selection
    to the full 128 registers.
    Instructions with a rounding mode are limited to 64 registers.


    Only a few instructions include rounding modes, but only directly in jumbo-prefixed forms.

    Though:
    FADD/FADDG/FADDA could be considered as overlapping with the role of a rounding mode, but done more crudely via just using multiple different instructions (and the naming scheme wasn't super conistent).

    Well, and say, both FMULA and FDIVA exist, but what they do is quite different:
    FMULA giving a FMUL result at Binary32 equivalent precision;
    FDIVA giving a crude FDIV approximation intended to start up an N-R process.



    With 80-bit instructions supplying two more registers, a fused dot-
    product can be done.

    Rd = (Rs1*Rs2)+(Rs3*Rs4)

    It takes a lot of register ports though so I am not sure about trying to implement it. The FDP is only about 6500 LCs. Five read ports plus a
    port for the vector mask register plus a port for the rounding mode.


    Dunno there.

    In my case, I have a 6R3W regfile, but a 4R1W operation isn't really a
    thing in my case, more:
    3x 2R1W (64-bit)
    2x 3R1W (64-bit)
    1x 3R1W (128-bit, by ganging the first 2 lanes).


    Also each lane only natively provides for Imm33, so an Imm64 is split
    across 2 lanes.

    An Imm128 couldn't actually be done as the immediate-handling path isn't
    wide enough.




    OK...


    FWIW, I had been working on a new spec following the pattern of the
    IsaDesc doc specifically for XG3: https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/2026-04-17_XG3_IsaDesc.txt

    Still needs a bit more work, thus far I mostly got it just past the
    level of "scaffold" and copy/pasted a bunch of text from the other spec.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Apr 21 07:07:49 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:

    Stephen Fuld <[email protected]d> posted:

    On 4/20/2026 11:36 AM, MitchAlsup wrote:
    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?

    Since * is commutative, sign control over * can be applied
    to either R2 or R3

    Actually, that's only true because

    1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b

    2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b

    I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
    or -0.

    But in most code the difference is not important, so few programmers
    write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
    and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).

    In the few cases where the difference (if it exists) is important,
    (-a)*(-b) can be encoded as negation followed by FMAC.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Tue Apr 21 06:43:28 2026
    From Newsgroup: comp.arch

    On 2026-04-21 2:33 a.m., Thomas Koenig wrote:
    Robert Finch <[email protected]> schrieb:

    The function code is not present in all instructions. Loads and stores
    immediate operates and other miscellaneous instructions do not have the
    field. They go by the seven bit primary opcode.

    I have found the breakdown of opcode and function code to be very
    packed, most of the codes and combinations of codes are used. I do not
    think there is much entropy wasted. Is there a tool that can estimate
    entropy? Something that can scan a binary file and rate the entropy?

    Mitch has a very efficient packing of bits in his ISA, which has
    32-bit istructions. It would be possible in theory (not suggesting
    that you should do it :-) to take his encodings, make all register
    specifiers 7 bit to accommodate your 128 registers (which would
    give you 40 instead of 32 bits for four-register instructions)
    and then wonder what to do with the holes left by the instructions
    with fewer than four registers.

    Tempting. But it would end up just being a different implementation of
    that ISA. Some things like constants are done differently in Qupls5 ISA.
    The Q+ ISA is supporting vector masking, which might be done with a
    modified PRED modifier.

    For Q+5 compares and branches are also handled differently.

    I cannot seem to get variable length instructions to work at frequency
    due probably mostly to routing.

    To support something like Mitch's ISA or RISC-V it could be done by
    modifying the parser in the front-end.

    An issue may be the support for instructions that are used for only
    specific types of programs. There are lots of instructions used for half
    and single precision, but if a program is only using double precision
    then these are wasted.

    Momentarily thinking of a dynamically changing ISA based on the program
    class. It could be controlled by a register in the CPU.

    I don't think this is a good idea. I assume you would want to
    reuse the opcode space (let's call the versions then A and B).
    What if you use ISA A, and the program dynamically loads a library
    using ISA B? Or is this something that each function would
    have to set dynamically?


    I think a program would need to be linked against a library with the
    same ISA. I think this is already done for building software for
    different machines.

    I was thinking of hopping ISAs on a dynamic basis, like frequency
    hopping. While SW is running.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Apr 21 17:59:45 2026
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    On 2026-04-20 7:37 p.m., BGB wrote:
    On 4/20/2026 1:55 PM, Thomas Koenig wrote:
    ----------------------

    The function code is not present in all instructions. Loads and stores immediate operates and other miscellaneous instructions do not have the field. They go by the seven bit primary opcode.

    I have code out of LLVM compiler that has 42 FMAC FU instructions
    in a row. {Many have constants thus no LDs or STs, several have
    FDIV and/or elementary transcendentals {SIN(), COS(), LN(), EXP()}

    I have found the breakdown of opcode and function code to be very
    packed, most of the codes and combinations of codes are used. I do not
    think there is much entropy wasted. Is there a tool that can estimate entropy? Something that can scan a binary file and rate the entropy?

    Take the size in bytes and compare against your favorite competitor.

    An issue may be the support for instructions that are used for only
    specific types of programs. There are lots of instructions used for half
    and single precision, but if a program is only using double precision
    then these are wasted.

    Momentarily thinking of a dynamically changing ISA based on the program class. It could be controlled by a register in the CPU.

    Each mode adds 1 to the exponent of verification complexity.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Tue Apr 21 21:06:25 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    MitchAlsup <[email protected]d> writes:

    Stephen Fuld <[email protected]d> posted:

    On 4/20/2026 11:36 AM, MitchAlsup wrote:
    Sign Control gives me:

    FMAC Rd= R1+R2*R3
    FMAC Rd=-R1+R2*R3
    FMAC Rd= R1-R2*R3
    FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?

    Since * is commutative, sign control over * can be applied
    to either R2 or R3

    Actually, that's only true because

    1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b

    2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b

    I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
    or -0.

    They do, including the single interesting case of both a & b being +/- zero:
    + * + -> +
    + * - -> -
    etc

    When just one operand is zero, the result is also zero, and the sign
    follows the usual product rules:

    positive a * -0.0 -> -0.0

    But in most code the difference is not important, so few programmers
    write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
    and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).

    Should not be needed. We try hard to attain minimum surprise factor.

    In the few cases where the difference (if it exists) is important,
    (-a)*(-b) can be encoded as negation followed by FMAC.

    Ditto.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Wed Apr 22 14:28:05 2026
    From Newsgroup: comp.arch

    On 4/21/2026 2:06 PM, Terje Mathisen wrote:
    Anton Ertl wrote:
    MitchAlsup <[email protected]d> writes:

    Stephen Fuld <[email protected]d> posted:

    On 4/20/2026 11:36 AM, MitchAlsup wrote:
    Sign Control gives me:

           FMAC Rd= R1+R2*R3
           FMAC Rd=-R1+R2*R3
           FMAC Rd= R1-R2*R3
           FMAC Rd=-R1-R2*R3

    Is there no need for sign control of R3?

    Since * is commutative, sign control over * can be applied
    to either R2 or R3

    Actually, that's only true because

    1) Negation distributes over multiplication, i.e., a*(-b)=-(a*b)=(-a)*b

    2) Double negation cancels out, i.e. -(-a)=a, and (-a)*(-b)=a*b

    I am not sure if these algebraic laws hold for IEEE FP if a or b are 0
    or -0.

    They do, including the single interesting case of both a & b being +/-
    zero:
    + * + -> +
    + * - -> -
    etc

    When just one operand is zero, the result is also zero, and the sign
    follows the usual product rules:

    positive a * -0.0 -> -0.0

    The rules follow, but ironically I have run into code before that
    (ab)used floating point in such a way that actually following the IEEE
    rules caused things to break.

    Seemingly, always producing +0 was the way to make the code work.



    But in most code the difference is not important, so few programmers
    write (-a)*(-b), and therefore it's good enough to provide for (-a)*b
    and a*(-b) (encoded as (-b)*a thanks to commutativity indeed).

    Should not be needed. We try hard to attain minimum surprise factor.

    In the few cases where the difference (if it exists) is important,
    (-a)*(-b) can be encoded as negation followed by FMAC.

    Ditto.


    Meanwhile, I am left to consider the possibility of a special case of a multiply where:
    + * + => +
    - * - => -
    Others: Unknown (either AND or OR)

    Mostly for sake of a possible FSSQR operator.
    But, this is niche enough that it is not obvious whether it would be justified.



    Did end up crossing the threshold of adding a special case constant-load instruction for repeating the same 16 bits 4 times.
    PLDCSW Imm16u, Rn
    Rn = Imm16u | (Imm16u<<16) | (Imm16u<<32) | (Imm16u<<48)

    While not super-common, it was (excluding floating point values) the
    most common case for values that failed with one of the other constant-load-cases.


    Looking over the dumped fail-cases (that required a full 64-bit
    constant), thus far the majority (in the test program I was looking at)
    appear to be things like fractions with an NPOT divisor:
    1/3, 1/5, 1/7, 1/9,
    2/3, 2/7, ...

    In this case, the fractions dominate over the x.yyy pattern seen in
    wider searches.


    Many could fit a pattern though of, say, collapsing down to 32-bits by omitting the middle part, say:
    (63:36), (3:0)
    Then, unpacking is, say:
    (31:4), (11:4), (11:4), (11:4), (11:4), ( 3:0)


    Though, debatable if worth it, as there are relatively few of them (and
    would merely reduce a 96-bit encoding to a 64-bit encoding).

    In many of these fractions, this middle section is merely a repeating
    byte, so this is how these would be compressible. It could also work for
    many of the x.yyy cases (many fill the low-order bits with a simple
    repeating pattern, apart from whatever rounding happens in the last nybble).

    Doesn't look like the pattern is predictable enough to cram it down into
    a 16-bit format though.

    ...



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Fri Apr 24 10:55:16 2026
    From Newsgroup: comp.arch

    For Q+5, the vector mask register, status register selected and round
    mode selection are now specified using instruction modifiers. The
    modifiers apply for groups of up to eight instructions.

    There is already a way to specify the vector mask over a group of
    instructions in the Arpl language:
    vector_mask mvar;
    vector float res, a, b;
    res = mvar(a + b);

    A similar paradigm could be used to specify the status register and
    round mode selection:
    res = __frm(a+b,RNE);
    res = __fstat(a+b,1); // update status register 1

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Fri Apr 24 22:53:16 2026
    From Newsgroup: comp.arch

    Q+5 now with vertical instruction encoding technology (another name for modifiers).

    Added a vector mask VMASK modifier to go along with the PRED modifier.

    Specifies the vector mask register (vm0 to vm7) for each following instruction.

    Messy logic that must go in the decode stage before rename.

    Strange to think that vertically encoded instructions are actually
    encoded horizontally. And a vertical list of instructions is encoded horizontally. Right up there with micro-ops.





    --- Synchronet 3.21f-Linux NewsLink 1.2