• Re: Pseudo-Immediates as Part of the Instruction

    From BGB@[email protected] to comp.arch on Sun Aug 10 18:59:29 2025
    From Newsgroup: comp.arch

    On 8/10/2025 1:07 PM, John Savard wrote:
    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up
    absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of the faults of the Concertina II ISA, the block structure, especially at its current level of complexity, is going to come across as quite weird to
    many, and I don't yet see any hope of achieving a drastic simplification
    in that area.


    OK.

    I judge things here by a few criteria:
    Could be affordably implemented in hardware;
    Would be usable and useful;
    Mostly makes sense in terms of relative cost/benefit tradeoffs.

    I am a little more pessimistic on things that I don't really feel
    satisfy the above constraints.

    For comparison, RISC-V mostly satisfies the above, although:
    Many of the extensions are weaker on these points;
    Some of the encodings, and the 'C' extension in general,
    are badly dog chewed.


    Then again, my ISA has potentially ended up with an excess of niche-case format converter instructions and similar.


    Each of the sixteen block types serves one or another functionality which
    I see as necessary to give this ISA the breadth of application that I have
    as my goal.


    Many make it work with plain 32-bit or 16/32 encodings.

    Granted, I have ended up with more:
    16/32/64/96, depending on ISA.
    XG1, 16/32/64/96
    XG2, 32/64/96
    XG3, 32/64/96 (32/64 for RV ops)
    RV, 16/32/(48)/64


    Apparently, Huawei and similar have some 48-bit encodings defined for
    RV64. In my sensibilities, 48-bit only makes sense if one is already
    committed to 16 bit ops, but given how quickly they burnt through the
    encoding space; practically the 48-bit space would just end up being a space-saving subset of the 64-bit space (in my experimental attempt to
    deal with the 48-bit encodings, they were unpacked temporarily into the
    64-bit encoding space).

    Basically, they burnt through most of the 48-bit encoding space with a
    handful of Imm32 and a few Disp32 ops. If it were me I would have gone
    for Imm24 ops and had a little more encoding space left over.

    Did experimentally mock up a 48-bit scheme that did basically extend the 32-bit space to have Imm24 (adding 12 bits to each Imm/Disp for all the Imm12/Disp12 ops), but it was a little dog chewed. Could potentially
    lead alternate encodings for Imm32 constant load and Disp32 branch (by
    adding 12 bits to LUI and JAL).

    One can argue though, which would they rather have:
    Pretty much all of the 32-bit immediate forms extended to 24 bits;
    Or, 32-bits immediate values,
    but only for a very limited range of ops.

    Though, I suspect for general use, extending the whole ISA to 24 bits
    might be "better" for average case code density (with 64-bit encodings
    for cases when one needs Imm32).

    Then again, I am on the fence about 48 bit encodings in general:
    Helps code density;
    Hurts performance for a cheap core;
    Say, if one doesn't want to spend the cost of dealing with superscalar
    for misaligned instructions and 16 bit ops (doing so would add
    significant resource cost).



    I did experiment with adding the C extension to BGBCC, and RV64GC+Jumbo
    can seemingly get decent code density.

    Granted, both are mostly similar here, both using 5-bit register fields.
    Though, XG1 16-bit ops mostly have access to 16 registers;
    And, RV-C ops mostly are a mix of 8 and 32 registers.

    Did experiment with a pair encoding for XG3 (X3C), which doesn't match
    either XG1 or RV64GC+Jumbo in terms of code density. But not too far off.

    At the moment (Doom ".text" size, static-linked C library):
    XG1: 275K
    XG2: 290K
    RV64GC+Jumbo: 295K (vs 350K RV+Jumbo, or 370K RV64GC)
    XG3+X3C: 305K (vs 320K)

    Granted, XG3 isn't designed for maximum code density, rather performance
    and being able to merge with RV64G.

    It is unclear if the improvement in code density (of X3C) would be worth
    the added decoder cost (and doesn't fit in with the existing decoder
    paths for XG1 or RVC; so would need something new/wacky to deal with it).

    Though, could deal with it (in the core) in a similar way to how I dealt
    with 48-bit ops, namely unpacking it to a 64-bit form (two instructions)
    after fetch.


    In theory, XG3 should be able to match XG2 code density as there isn't
    really anything that XG2 has that XG3 lacks that would significantly
    effect code density. XG3 did drop the 2RI-Imm10 ops, but these had
    largely become redundant. So, the main difference is likely related to
    BGBCC itself, which is mostly treating XG3 as an extension of its RV64G
    mode (which "suffers" slightly by having less usable callee save
    registers in the ABI, and fewer register arguments; but had on/off
    considered tweaking the ABI here).

    Though, if XG3 did match XG2 code density, X3C could potentially also
    reduce it to 275K.

    But, could just focus more on RV64GC here, as I sorta already needed it,
    and recently found/fixed a bug in the decoder in my CPU core that was
    stopping the 'C' extension from working (so now it seems to work).


    Though, to recap (X3C):
    X3C packs a 13 and 14 bit instruction together into a 32 bit word;
    Which serves a similar purpose to RVC;
    Though only allows instruction pairs which can safely co-execute.
    Instructions encode:
    MOV/ADD Rm5, Rn5
    LI/ADD/ADDW Imm5s, Rn5
    SUB/ADDW/ADDWU/AND/OR/XOR Rm3, Rn3
    SLL/SRL/SRA Rm3, Rn3
    SLLW/SRLW/SLAW/SRAW Rm3, Rn3
    SLL/SRL/SRA Imm3, Rn3
    SLLW/SRLW/SLAW/SRAW Imm3, Rn3
    And, for the 14-bit case:
    LD/SD/LW/SW Rn5, Disp5(SP)
    LD/SD/LW/SW Rn3, Disp2(Rm3)
    LB/LBU/LH/LHU Rn3, 0(Rm3)
    SB/SH Rn3, 0(Rm3)

    X3C was put into a hole in the encoding space that previously held the
    PrWEX space (in XG1/XG2), but PrWEX is N/A in XG3. The WEX space is N/A
    (used for RV encodings, and the large-constant instruction was replaced
    with the XG3's Jumbo Prefix). Granted, the scope of X3C is more limited
    than that of RV-C.


    But I have introduced "scaled displacements" back in, allowing the
    augmented short instruction mode instruction set to be more powerful.


    OK.

    Yeah, scaled displacements make sense.


    Ironically, another one of my complaints about RVC is that while they
    saved bits in the displacements, rather than doing something sane like changing scale based on type, they bit-sliced the displacements based on
    type in a way that means it effectively has unique displacement
    encodings for:
    LW, Disp(SP)
    SW, Disp(SP)
    LD, Disp(SP)
    SD, Disp(SP)
    LW, Disp(Reg3)
    SW, Disp(Reg3)
    LD, Disp(Reg3)
    SD, Disp(Reg3)
    Which is, groan...

    Would have been better, say, if all the encodings just sorta had Rd/Rs2
    in the same spot and then not had separate Load/Store encoding.
    IMHO, having Rd and Rs2 in the same location is a lesser evil than
    having twice as many displacement types.

    And, also adjusting scale is a lesser evil than separate bit slicing for
    each type.



    Though, it does lead to the partial irony that despite XG3 having a
    longer listing than RV64G, when I wrote a VM that did both RV64 and XG3,
    the XG3 decoder is smaller due partly due to "less dog chew".

    The decoder is bigger in the Verilog core, but this is mostly because
    XG1/2/3 all use a shared decoder. An XG3 exclusive decoder would be smaller.

    Though, maybe moot if one is also going to need a RISC-V decoder, unless
    I make a purely XG3 target that doesn't use any of the RV encodings.




    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Aug 11 10:27:08 2025
    From Newsgroup: comp.arch

    On 8/10/2025 11:07 AM, John Savard wrote:
    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up
    absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of the faults of the Concertina II ISA,

    I am not sure what you are saying here. Is it the while you agree that
    at least some features cannot be put to use, but that isn't the fault of
    the ISA, or that the fault of not being able to be put to use doesn't
    exist in the ISA?


    the block structure, especially at its
    current level of complexity, is going to come across as quite weird to
    many, and I don't yet see any hope of achieving a drastic simplification
    in that area.

    Each of the sixteen block types serves one or another functionality which
    I see as necessary to give this ISA the breadth of application that I have
    as my goal.

    While I agree that they meet your goals (at least as I understand them),
    I think that you have two problems.

    Your goals, even if you meet them aren't particularly useful, e.g. being "nearly" plug compatible with S/360

    There are *far* simpler ways to accomplish what most people really want
    to do.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@[email protected] to comp.arch on Mon Aug 11 18:20:05 2025
    From Newsgroup: comp.arch

    On Mon, 11 Aug 2025 10:27:08 -0700, Stephen Fuld wrote:
    On 8/10/2025 11:07 AM, John Savard wrote:
    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up
    absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of
    the faults of the Concertina II ISA,

    I am not sure what you are saying here. Is it the while you agree that
    at least some features cannot be put to use, but that isn't the fault of
    the ISA, or that the fault of not being able to be put to use doesn't
    exist in the ISA?

    What I was trying to say was that while the Concertina II ISA no doubt has many flaws, not being able to crank out useful work is, in my opinion, not
    one of them.

    On the other hand, driving insane those who attempt to program it or write compilers for it must be admitted to be an obstacle to making use of a
    given CPU, and so I must admit to its usability being limited in that
    manner.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@[email protected] to comp.arch on Mon Aug 11 18:33:14 2025
    From Newsgroup: comp.arch

    On Mon, 11 Aug 2025 10:27:08 -0700, Stephen Fuld wrote:

    Your goals, even if you meet them aren't particularly useful, e.g. being "nearly" plug compatible with S/360

    There are *far* simpler ways to accomplish what most people really want
    to do.

    Being plug-compatible with System/360 is not among the goals of my ISA.
    The term "plug-compatible" refers to... _plugs_, as one might guess.
    Nothing in my ISA talks about stuff like USB ports, Centronics parallel ports... or the kind of port IBM used to connect a 1403 printer to a System/360 computer.

    There are certainly far simpler ways to run System/360 code correctly.
    One can just set a mode bit to enter System/360 emulation, for example.

    What I'm doing with the Type V header is to provide a way to imitate the behavior of a System/360 program after code conversion. So one could write
    a special FORTRAN compiler to generate code using this header to allow a FORTRAN program running on the Concertina II to deliver the same results
    as on a System/360.

    And this isn't simple because it's buried deep down in the instruction set
    as an _afterthought_ within an ISA which is primarily designed to do the
    same sort of work as one might do with an x86-64 chip or a PowerPC chip or
    a SPARC chip even. And secondarily designed to be capable of
    implementations which shine at whatever the TMS20C6000 shines at, or even whatever, if anything, the Itanium was good for.

    It may not, however, be lost on implementors that a full implementation of
    the Type V header stuff ends up putting the needed circuitry on the die to *provide* a very nice System/360 emulation or implementation, which they
    might offer as an added feature not defined in the Concertina II specification.

    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@[email protected] to comp.arch on Mon Aug 11 19:16:06 2025
    From Newsgroup: comp.arch

    On Mon, 11 Aug 2025 18:33:14 +0000, John Savard wrote:

    implementations which shine at whatever the TMS20C6000 shines at, or

    Oops, the TMS320C6000.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Aug 24 18:16:12 2025
    From Newsgroup: comp.arch


    John Savard <[email protected]d> posted:

    On Sun, 03 Aug 2025 13:03:21 -0700, Stephen Fuld wrote:

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real programs* . If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture. The best way to learn about what features are
    useful is to try to use them! and the best way to do that is to write actual code for a real program.

    While I'm not prepared to go to the trouble of creating a fleshed-out example, a very short and trivial example will still indicate what my
    goals are.

    X = Y * 2.78 + Z

    Just playing devil's advocate:: My 66000

    LDD R8,[Y]
    LDD R6,[Z]
    FMAC R7,R8,#2.78D0,R6
    STD R7,[X]

    X, Y, and Z can be anywhere in 64-bit VAS ...
    On the other hand if X, Y, and Z were allocated into registers::

    FMAC Rx,Ry,#2.78D0,Rz

    On a typical RISC architecture, this would involve instructions like this:

    load 18, Y
    load 19, K#0001
    fmul 18, 18, 19
    load 19, Z
    fadd 18, 18, 19
    fsto X

    Six instructions, each 32 bits long.

    On the IBM System/360, though, it would be something like

    le 12, Y
    me 12, K#0001
    ae 12, Z
    ste 12, x

    All four instructions are memory-reference instructions, so they're also
    32 bits long.

    How would I do this on Concertina II?

    Well, since the sequence has to start with a memory-reference, I can't use the zero-overhead header (Type I). Instead, a Type XI header is in order; that specifies a decode field, so that space can be reserved for a pseudo- immediate, and instruction slots can be indicated as containing
    instructions from the alternate instruction set.

    Then the instructions can be

    lf 6,y
    mfr 6,#2.78
    af 6,z
    stf 6,x

    with the instruction "af" coming from the alternate 32-bit instruction set.

    The other tricky precondition that must be met is to store z in a data region that is only 4,096 bytes or less in size, prefaced with

    USING *,23

    or another register from 17 to 23 could be used as the base register, so that it is addressed with a 12-bit displacement. (Also, register 6, from
    the first eight registers, is used to do the arithmetic to meet the limitations of the "add floating" memory to register operate instruction
    in the alternate instruction set.)

    Because it uses a pseudo-immediate, which gets fetched along with the instruction stream, where the 360 uses a constant, it has an advantage
    over the 360. On the other hand, while the actual code is the same length, there's also the 32-bit overhead of the header.

    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Aug 24 19:50:44 2025
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 8/5/2025 11:51 AM, Stephen Fuld wrote:
    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <[email protected]d> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK.  I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial, I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.  After all, both of those instructions can be accomplished by two "standard" instructions, a store and an add (for push) and a load and subtract (for pop).  Interchange
    the add and the subtract if you want the stack to grow in the other direction.

    Of course, you are free to stop contributing on this topic, but I, for one, will miss your contributions.



    The lack of dedicated PUSH/POP instructions IME has relatively little
    direct impact on the usability of an ISA. Either way, one is likely to
    need stack-frame adjustment, in which case PUSH/POP don't tend to offer
    much over normal Load/Store instructions.

    When I looked at this at AMD circa 2000, I found many Pushes/Pops occurred
    in short sequences of 2-4; like:

    Push EAX
    Push EBP
    Push ECX

    a) we should note pushes are serially dependent on the decrement of SP
    b) and so are the memory references

    But we could change these into::

    ST EAX,[SP-8]
    ST EBP,[SP-16]
    ST ECX,[SP-24]
    SUB Sp,SP,24

    a) now all the memory references are parallel
    b) there is only one alteration of SP
    c) all 4 instructions can start simultaneously
    So, latency goes from 3 to 1.

    That said, a lot of John's other ideas come off to me like straight up absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Sun Aug 24 16:21:06 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    BGB <[email protected]> posted:
    The lack of dedicated PUSH/POP instructions IME has relatively little
    direct impact on the usability of an ISA. Either way, one is likely to
    need stack-frame adjustment, in which case PUSH/POP don't tend to offer
    much over normal Load/Store instructions.

    When I looked at this at AMD circa 2000, I found many Pushes/Pops occurred
    in short sequences of 2-4; like:

    Push EAX
    Push EBP
    Push ECX

    a) we should note pushes are serially dependent on the decrement of SP
    b) and so are the memory references

    But we could change these into::

    ST EAX,[SP-8]
    ST EBP,[SP-16]
    ST ECX,[SP-24]
    SUB Sp,SP,24

    a) now all the memory references are parallel
    b) there is only one alteration of SP
    c) all 4 instructions can start simultaneously
    So, latency goes from 3 to 1.

    Except storing below the SP is not interrupt safe without
    something special like defining a safe "red zone" below it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Aug 29 15:31:32 2025
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    On 2025-08-01 5:04 p.m., John Savard wrote:
    On Fri, 01 Aug 2025 18:08:17 +0000, Thomas Koenig wrote:

    Question: Do the pointers point to the same block only, or also to other >> blocks? With 5 bits, you could address others as well. Can you give an
    example of their use, including the block headers?

    Actually, no, 5 bits are only enough to point within the same block.
    That's because it's a byte pointer, as it can be used to point to any type of constant, including single byte constants.

    This is despite the fact that I do have an instruction format for conventional style byte immediates (and I've just squeezed in one for 16-bit immediates as well).

    However, they _can_ point to another block, by means of a sixth bit that some instructions have... but when this happens, it does not trigger an extra fetch from memory. Instead, the data is retrieved from a copy of an earlier block in the instruction stream that's saved in a special register... so as to reduce potential NOP-style problems.

    John Savard

    I tried something similar to this but without block headers and it
    worked okay. But there were a couple of issues. One was the last
    instruction in cache line could not have an immediate. Or instructions
    had to stop before the end of the cache line to accommodate immediates.
    This resulted in some wasted space. There would sometimes be a 32-bit
    hole between the last instruction and the first immediate. I used a
    four-bit index and 32-bit immediate, instruction word size. Four bits
    was enough for a 512-bit (cache line size). IIRC the wasted space was
    about 5%.

    We really don't want to waste space.

    It made the assembler more complex. I had immediates being positioned
    from the far end of the cache line down (like a stack) towards the instructions which began at the lower end. The assembler had to be able
    to keep track of where things were on the cache line and the assembler
    was not built to handle that.
    Also, it made reading listings more difficult as constants were in the middle of sequences of instructions.

    We really don't want to make it any harder to read ASM code.

    Sometimes constants could be shared, but this turned out to be not
    possible in many cases as the assembler needed to emit relocation
    records for some constants and it could not handle having two or more instructions pointing to the same constant.

    All the more reason to place the constant in the instruction stream.
    a) never wastes space*
    b) ASM readability

    (*) never wastes space refers to placement of constant, not that the constant-container is optimal for the placed constant.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Aug 29 19:35:15 2025
    From Newsgroup: comp.arch


    Lawrence D'Oliveiro <[email protected]d> posted:

    On Fri, 1 Aug 2025 15:11:49 -0000 (UTC), John Savard wrote:

    Well, that pointer - five bits long - is an awfully short pointer. Where does it point?

    Instructions are fetched in blocks that are 256 bits long. One of the things this allows for is for the block to begin with a header that specifies that a certain number of 32-bit instruction slots at the end
    of the current block are to be skipped over in the sequence of
    instructions to be executed; this space can be used for constants.

    Just add a couple of modifier bits: one is the indirect bit, indicating
    that the location referenced contains the address of the value, not the value itself, and another “page zero” bit, which indicates that the location is not in the current block, but in another block at a fixed address ...

    What is the purported advantage of using a header instead of just having
    each instruction define its own length ?? and contain its own constants?

    ... and I start having PDP-8 flashbacks.

    As well you should.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Sep 3 18:26:18 2025
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <[email protected]d> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial,

    AMD's (and my) K9 translated push and pop into ST and LD followed
    by sub/add, and then peephole combined the several adds so that a
    sequence of instructions::

    Push RAX
    Push RCX
    Push RDX

    became a parallel list of Operations::

    ST RAX,[SP-8]
    ST RCX,[SP-16]
    ST RDX,[SP-24]
    SUB SP,SP,#24

    Taking a data-dependent series of instructions (minimum of 3 cycles)
    and allowing all of them begin execution in the same cycle. This is
    the fallacy of {push, pop, (Rx)++, --(Rx), and similar}. With GBOoO
    it is data-dependent latency that maters, not instruction count.

    I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.

    Push and Pop only scratch the surface.

    After all, both of those
    instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop). Interchange
    the add and the subtract if you want the stack to grow in the other direction.

    Which we quit doing 30-odd years ago.

    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Wed Sep 3 14:55:39 2025
    From Newsgroup: comp.arch

    On 9/3/2025 1:26 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <[email protected]d> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be
    beneficial,

    AMD's (and my) K9 translated push and pop into ST and LD followed
    by sub/add, and then peephole combined the several adds so that a
    sequence of instructions::

    Push RAX
    Push RCX
    Push RDX

    became a parallel list of Operations::

    ST RAX,[SP-8]
    ST RCX,[SP-16]
    ST RDX,[SP-24]
    SUB SP,SP,#24

    Taking a data-dependent series of instructions (minimum of 3 cycles)
    and allowing all of them begin execution in the same cycle. This is
    the fallacy of {push, pop, (Rx)++, --(Rx), and similar}. With GBOoO
    it is data-dependent latency that maters, not instruction count.


    Though, this can be more an argument that having PUSH and POP is not worthwhile. Maybe they help slightly with code density, but this is
    about it.


    They were something I ended up dropping earlier on, as it started to
    become obvious at the time that having them was net negative.


    It is possible that the assembler can fake them as pseudo-instructions,
    but even this doesn't seem worthwhile to do so.

    Well, nevermind if I did eventually go and add the logic in the
    assembler to fake auto-increment addressing modes in RISC-V and similar.

    So, say:
    MOV.L @R10+, R13
    MOV.L R13, @-R11
    Or:
    MOV.L (R10)+, R13
    MOV.L R13, -(R11)

    Will at least work (by cracking each into multiple instructions), but, ...


    Still part of the ongoing tension of BGBCC targeting RV while using AT&T
    style ASM syntax (and, for ASM fragments, it trying to infer the operand ordering per fragment based on which nmemonics are used, which isn't
    helped by my specs sort of mixing the use of mnemonics). If there isn't
    enough to infer a choice based off of, it defaulting to the AT&T style
    operand ordering.

    Possible foot-guns all around here, but lack a better solution ATM.


    I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.

    Push and Pop only scratch the surface.

    After all, both of those
    instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop). Interchange
    the add and the subtract if you want the stack to grow in the other
    direction.

    Which we quit doing 30-odd years ago.


    I think the usual argument for grows-upwards stack being that
    (presumably) it makes it less likely that a buffer overflow will hit the saved-registers area.

    But, pretty much everyone settled on grows-down stack.

    On a RISC-style ISA, main difference it makes is that the OS and ABI
    need to agree on which direction the stack goes. Though, in theory,
    assuming it were an ABI choice, a flag in the binary or similar could be
    used to signal stack direction. Granted, and DLLs/SOs would also need to
    agree which way the stack goes.



    Actually, could almost make a case for big-endian support, with binaries setting a flag for big-endian, and some sort of CPU control flag to set operation into big-endian mode.

    But, say, having a mismatch between OS and application endianess is
    asking for a mess.

    Less awful being that pretty much everything defaults to little-endian,
    but then having ISA support for endian-swapping, and some way to flag
    data as big-endian.

    FWIW, BGBCC has a __bigendian modifier, but this is pretty nonstandard
    (and not well tested).
    IIRC, ATM, would need to be applied to every member in a struct for a
    fully BE struct, but could maybe make sense to allow it to apply to a
    whole struct (in a similar way to "__packed" or "__attribute__((packed))").

    Mostly only applies to struct members and pointers, and only for integer types. Would also suck on RISC-V, which lacks any good way to do endian swapping.


    But, at least in the case of big-endian, it is commonly used in network protocols and some file formats, so not completely useless.


    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.


    --- Synchronet 3.21a-Linux NewsLink 1.2