• On Cray arithmetic

    From Thomas Koenig@[email protected] to comp.arch on Sat Oct 11 10:32:22 2025
    From Newsgroup: comp.arch

    Just found a gem on Cray arithmetic, which (rightly) incurred
    The Wrath of Kahan:

    https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

    "Pessimism comes less from the error-analyst's dour personality
    than from his mental model of computer arithmetic."

    I also had to look up "equipollent".

    I assume many people in this group know this, but for those who
    don't, it is well worth reading.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Oct 11 19:36:44 2025
    From Newsgroup: comp.arch


    Thomas Koenig <[email protected]> posted:

    Just found a gem on Cray arithmetic, which (rightly) incurred
    The Wrath of Kahan:

    https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

    I hope BGB reads this and takes it to heart.

    "Pessimism comes less from the error-analyst's dour personality
    than from his mental model of computer arithmetic."

    I also had to look up "equipollent".

    I assume many people in this group know this, but for those who
    don't, it is well worth reading.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 12 00:28:16 2025
    From Newsgroup: comp.arch

    On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

    Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
    of Kahan ...

    No harm in reminding everyone of his legendary foreword to the
    Standard Apple Numerics manual, 2nd ed, of 1988. He had something
    suitably acerbic to say about a great number of different vendors’
    idea of floating-point arithmetic (including Cray).

    I posted one instance here <http://groups.google.com/group/comp.lang.python/msg/5aaf5dd86cb00651?hl=en>. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 12 01:15:23 2025
    From Newsgroup: comp.arch

    On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

    https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

    Anybody curious about what’s on pages 62-5 of the Apple Numerics Manual
    2nd ed can find a copy here <https://vintageapple.org/inside_o/>.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@[email protected] to comp.arch on Sun Oct 12 04:04:46 2025
    From Newsgroup: comp.arch

    On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

    Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
    of Kahan:

    While the arithmetic on the Cray I was bad enough, this document seems to focus on some later models in the Cray line, which, like the IBM System/
    360 when it first came out, before an urgent retrofit, lacked a guard
    digit!

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 12 06:06:35 2025
    From Newsgroup: comp.arch

    On Sun, 12 Oct 2025 04:04:46 -0000 (UTC), John Savard wrote:

    On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

    Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
    of Kahan:

    While the arithmetic on the Cray I was bad enough, this document seems
    to focus on some later models in the Cray line, which, like the IBM
    System/ 360 when it first came out, before an urgent retrofit, lacked a
    guard digit!

    The concluding part of that article had a postscript which said that,
    while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@[email protected] to comp.arch on Mon Oct 13 07:23:21 2025
    From Newsgroup: comp.arch

    On Sun, 12 Oct 2025 06:06:35 +0000, Lawrence D’Oliveiro wrote:

    The concluding part of that article had a postscript which said that,
    while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.

    That is a pity.

    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently quoted here.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Mon Oct 13 07:39:11 2025
    From Newsgroup: comp.arch

    John Savard <[email protected]d> writes:
    After reading that article, I looked for more information on other >processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of >the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently >quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the
    pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Mon Oct 13 09:05:18 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I think
    -- to be persuaded that IEEE754 really was worth implementing in its
    entirety, that the “too hard” or “too obscure” parts were there for an important reason, to make programming that much easier, and should not be skipped.

    You’ll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
    it completely in software, where their hardware vendor of choice at the
    time (Motorola), skimped a bit on hardware support.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Mon Oct 13 13:12:12 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:
    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing
    in its entirety, that the “too hard” or “too obscure” parts were there for an important reason,
    It took many years to figure it out for *DEC* hardware designers.
    Was there any other general-purpose RISC vendor that suffered from
    similar denseness?
    to make programming that much easier,
    and should not be .
    For many non-obvious parts of 754 it's true. For many other parts, esp.
    related to exceptions, it's false.
    That is, they should not be skipped, but the only reason for that is
    ease of documentation (just write "754" and you are done) and access to
    test vectors. This parts are not well-thought, do not make application programming any easier and do not fit well into programming languages.

    You’ll notice that Kahan mentioned Apple more than once, as seemingly
    his favourite example of a company that took IEEE754 to heart and
    implemented it completely in software, where their hardware vendor of
    choice at the time (Motorola), skimped a bit on hardware support.
    According to my understanding, Motorola suffered from being early
    adapters, similarly to Intel. They implemented 754 before the standard
    was finished and later on were in difficult position of conflict
    between compatibility wits standard vs compatibility with previous
    generations.
    Moto is less forgivable than Intel, because they were early adapters
    not nearly as early.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Mon Oct 13 12:30:33 2025
    From Newsgroup: comp.arch

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <[email protected]d> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently
    quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the
    rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

    Though, reading some stuff, implies a predecessor chip (the R4000) had a
    more functionally complete FPU. So, I guess it is also possible that the
    R4300 had a more limited FPU to make it cheaper for the embedded market.


    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core;
    RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.



    As I see it though, if the overall cost of the traps remains below 1%,
    it is mostly OK.

    Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
    enough to justify turning them into a syscall like handler. Though, in
    this case would likely overlap it with the Page-Fault handler (fallback
    path for the TLB Miss handler, which is also being used here for FPU emulation).

    Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts, so for
    longer running operations it is better to switch to a handler that can
    deal with interrupts (and, ATM, FDIV.Q and FSQRT.Q are kinda horridly
    slow; so, less like a TLB miss, and more like a page-fault...).

    The TestKern related code is getting a little behind in my GitHub repo,
    idea is that these parts will be posted when they are done.


    I had found/fixed one RVC bug since the last upload of the CPU core to
    GitHub, but more bugs remain and are still being hunted down.


    Progress is slow...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Oct 13 17:33:32 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I think -- to be persuaded that IEEE754 really was worth implementing in its entirety, that the “too hard” or “too obscure” parts were there for an
    important reason, to make programming that much easier, and should not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
    did make it easier--but NaNs, infinities, Underflow at the Denorm level
    went in the other direction.

    You’ll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
    it completely in software, where their hardware vendor of choice at the
    time (Motorola), skimped a bit on hardware support.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Mon Oct 13 21:08:56 2025
    From Newsgroup: comp.arch

    On 13/10/2025 19:33, MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I think >> -- to be persuaded that IEEE754 really was worth implementing in its
    entirety, that the “too hard” or “too obscure” parts were there for an
    important reason, to make programming that much easier, and should not be
    skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    It does not make the programs more reliable - it makes them more
    consistent, predictable and portable. It does not make things easier
    for most code (support for NaNs and infinities can make some code
    easier, if mathematically nonsense results are a real possibility). But
    since consistency, predictability and portability are often very useful characteristics, full IEEE 754 compliance is a good thing for
    general-purpose processors.

    However, there are plenty of more niche situations where these are not
    vital, and where cost (die space, design costs, run-time power, etc.) is
    more important. Thus on small microcontrollers, it can be a better
    choice to skip support for the "obscure" stuff, and maybe even cutting
    corners on things like rounding behaviour. The same applies for
    software floating point routines for devices that don't have hardware
    floating point at all.




    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
    did make it easier--but NaNs, infinities, Underflow at the Denorm level
    went in the other direction.

    You’ll notice that Kahan mentioned Apple more than once, as seemingly his >> favourite example of a company that took IEEE754 to heart and implemented
    it completely in software, where their hardware vendor of choice at the
    time (Motorola), skimped a bit on hardware support.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Oct 13 21:53:33 2025
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <[email protected]d> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had >> a branch delay slot, as well as using traps to implement some portions of >> the IEEE 754 standard... thus, presumably, being one of the architectures >> to inspire the piece about bad architectures from Linus Torvalds recently >> quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

    And this is why FP wants high quality implementation.

    Though, reading some stuff, implies a predecessor chip (the R4000) had a more functionally complete FPU. So, I guess it is also possible that the R4300 had a more limited FPU to make it cheaper for the embedded market.


    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core;
    RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.

    Do it right or don't do it at all.

    As I see it though, if the overall cost of the traps remains below 1%,
    it is mostly OK.

    While I can agree with the sentiment, the emulation overhead makes this
    very hard to achieve indeed.

    Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
    enough to justify turning them into a syscall like handler. Though, in
    this case would likely overlap it with the Page-Fault handler (fallback
    path for the TLB Miss handler, which is also being used here for FPU emulation).

    Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts,

    At the time of control arrival, interrupts are already reentrant in
    My 66000. A higher priority interrupt will take control from the
    lower priority interrupt.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Tue Oct 14 02:27:46 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing in
    its entirety, that the “too hard” or “too obscure” parts were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of “easier”.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went
    in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

    Denormals -- aren’t they called “subnormals” now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at
    once and going straight to zero. It’s about the principle of least
    surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Tue Oct 14 02:36:50 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    The hardware designers took many years -- right through the 1990s,
    I think -- to be persuaded that IEEE754 really was worth
    implementing in its entirety, that the “too hard” or “too obscure” >> parts were there for an important reason,

    It took many years to figure it out for *DEC* hardware designers.
    Was there any other general-purpose RISC vendor that suffered from
    similar denseness?

    I thought they all did, just about.

    You’ll notice that Kahan mentioned Apple more than once, as
    seemingly his favourite example of a company that took IEEE754 to
    heart and implemented it completely in software, where their
    hardware vendor of choice at the time (Motorola), skimped a bit on
    hardware support.

    According to my understanding, Motorola suffered from being early
    adapters, similarly to Intel. They implemented 754 before the
    standard was finished and later on were in difficult position of
    conflict between compatibility wits standard vs compatibility with
    previous generations. Moto is less forgivable than Intel, because
    they were early adapters not nearly as early.

    Let’s see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
    release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

    I would say Motorola had plenty of time to read the spec and get it
    right. But they didn’t. So Apple had to patch things up in its
    software implementation, introducing a mode where for example those
    last few inaccurate bits in transcendentals were fixed up in software, sacrificing some speed over the raw hardware to ensure consistent
    results with the (even slower) pure-software implementation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Mon Oct 13 22:38:18 2025
    From Newsgroup: comp.arch

    On 10/13/2025 4:53 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <[email protected]d> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had >>>> a branch delay slot, as well as using traps to implement some portions of >>>> the IEEE 754 standard... thus, presumably, being one of the architectures >>>> to inspire the piece about bad architectures from Linus Torvalds recently >>>> quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the
    pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the
    rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the
    programmers had compensated for the MIPS issues in code rather than via
    traps).

    And this is why FP wants high quality implementation.


    From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

    But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
    moving platforms gradually creeping away from the origin, etc.





    Though, reading some stuff, implies a predecessor chip (the R4000) had a
    more functionally complete FPU. So, I guess it is also possible that the
    R4300 had a more limited FPU to make it cheaper for the embedded market.


    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.

    Do it right or don't do it at all.


    ?...

    The traps route sorta worked OK in a lot of the MIPS era CPUs.
    But, it will be opt-in via an FPSCR flag.
    If the flag is not set, it will not trap.

    Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.



    For Binary128, real HW support is not likely to happen. The main reason
    to consider trap-only Binary128 is more because it has less code
    footprint than using runtime calls.

    Also on RISC-V, it is more expensive to implement 128-bit arithmetic, so
    the actual cost might be lower.

    The main deviation from the Q extension is that it will use register
    pairs rather than 128 bit registers. I suspect that likely 128-bit
    registers would make more problems for software built to assume RV64G,
    than the problems resulting from breaking the spec and using pairs.

    Or, if the proper Q extension were supported, would make more sense in
    the context of RV128, so XLEN==FLEN. Otherwise, Q on RV64 would break
    the ability to move values between FPRs and GPRs (in the RV spec, they
    note is the assumption that in this configuration, moves between FPRs
    and GPRs would be done via memory loads and stores). This would suck,
    and actively make the FPU worse than sticking primarily with the D
    extension and doing something nonstandard.


    As I see it though, if the overall cost of the traps remains below 1%,
    it is mostly OK.

    While I can agree with the sentiment, the emulation overhead makes this
    very hard to achieve indeed.


    Will have to test this more to find out.

    But, at least in the case of Binary128, the operations themselves are
    likely to be slow enough to partly offset the trap-handling and
    instruction decoding overheads.



    Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
    enough to justify turning them into a syscall like handler. Though, in
    this case would likely overlap it with the Page-Fault handler (fallback
    path for the TLB Miss handler, which is also being used here for FPU
    emulation).

    Partial issue is mostly that one doesn't want to remain in an interrupt
    handler for too long because this blocks any other interrupts,

    At the time of control arrival, interrupts are already reentrant in
    My 66000. A higher priority interrupt will take control from the
    lower priority interrupt.

    Yeah, no re-entrant interrupts here.

    For a longer-running operation, it is mostly needed to handle things
    with a context switch into supervisor mode. Can't use the normal SYSCALL handler though, as it itself may have been the source of the trap. So, Page-Fault needs its own handler task.


    It is likely that re-entrant interrupts would require a different and
    likely more complex mechanism.

    Well, and/or rework things at the compiler level so that the ISR proper
    is only used to implement a transition into supervisor mode (or from supervisor-mode back to usermode); and then fake something more like the
    x86 style interrupt handling.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Tue Oct 14 01:53:17 2025
    From Newsgroup: comp.arch

    Lawrence D’Oliveiro wrote:
    On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    The hardware designers took many years -- right through the 1990s,
    I think -- to be persuaded that IEEE754 really was worth
    implementing in its entirety, that the “too hard” or “too obscure” >>> parts were there for an important reason,
    It took many years to figure it out for *DEC* hardware designers.
    Was there any other general-purpose RISC vendor that suffered from
    similar denseness?

    I thought they all did, just about.

    You’ll notice that Kahan mentioned Apple more than once, as
    seemingly his favourite example of a company that took IEEE754 to
    heart and implemented it completely in software, where their
    hardware vendor of choice at the time (Motorola), skimped a bit on
    hardware support.
    According to my understanding, Motorola suffered from being early
    adapters, similarly to Intel. They implemented 754 before the
    standard was finished and later on were in difficult position of
    conflict between compatibility wits standard vs compatibility with
    previous generations. Moto is less forgivable than Intel, because
    they were early adapters not nearly as early.

    Let’s see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
    release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

    Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
    early ones were 754 compatible, but later chips from 1986 intended
    for the 386 were compatible, and they seem to have been used by many
    (Motorola, Intel, Sun, PA-RISC, ...)

    https://en.wikipedia.org/wiki/Weitek

    Unfortunately not all the chip documents are on bitsavers

    http://www.bitsavers.org/components/weitek/dataSheets/

    but the WTL-1164_1165 PDF from 1986 says

    FULL 32-BIT AND 64-BIT FLOATING POINT
    FORMAT AND OPERATIONS, CONFORMING TO
    THE IEEE STANDARD FOR FLOATING POINT ARITHMETIC

    2.38 MFlops (420 ns) 32-bit add/subtract/convert and compare
    1.85 MFlops (540 ns) 64-bit add/subtract/convert and compare
    2.38 MFlops (420 ns) 32-bit multiply
    1.67 MFlops (600 ns) 64-bit multiply
    0.52 MFlops (1.92 Jls) 32-bit divide
    0.26 MFlops (3.78 Jls) 64-bit divide
    Up to 3.33 MFlops (300 ns) for pipelined operations
    Up to 3.33 MFlops (300 ns) for chained operations
    32-bit data input or 32-bit data output operation every 60 ns


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Tue Oct 14 08:30:44 2025
    From Newsgroup: comp.arch

    On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:
    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing in
    its entirety, that the “too hard” or “too obscure” parts were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of “easier”.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went
    in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

    Denormals -- aren’t they called “subnormals” now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. It’s about the principle of least surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    But I find it harder to understand why denormals or subnormals are going
    to be useful. Ultimately, your floating point code is approximating arithmetic on real numbers. Where are you getting your real numbers,
    and what calculations are you doing on them, that mean you are getting
    results that have such a dynamic range that you are using denormals?
    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)? I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code - perhaps
    calculations should be re-arranged, algorithms changed, or you should be
    using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Tue Oct 14 06:56:46 2025
    From Newsgroup: comp.arch

    On Tue, 14 Oct 2025 01:53:17 -0400, EricP wrote:

    Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
    early ones were 754 compatible, but later chips from 1986 intended for
    the 386 were compatible, and they seem to have been used by many
    (Motorola, Intel, Sun, PA-RISC, ...)

    Weitek add-on cards, I think mainly the early ones, were popular with more hard-core power users of Lotus 1-2-3. Remember, that was the “killer app” that prompted a lot of people to buy the IBM PC (and compatibles) in the
    first place. Some of them must have been doing some serious number-
    crunching, such that floating-point speed became a real issue.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 07:51:09 2025
    From Newsgroup: comp.arch

    David Brown <[email protected]> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what
    programmers tend to expect. So NaNs have their pitfalls.

    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?

    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)?

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.

    I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    perhaps
    calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).

    The first two require more knowledge about FP than many programmers
    have, all just to avoid some hardware cost. Not a good idea in any
    area where the software crisis* is relevant. The last increases the
    resource usage much more than proper support for denormals.

    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware. So that's my decision
    criterion: If the software cost is higher than the hardware cost,
    the software crisis is relevant; and in the present context, it
    means that expending hardware to reduce the cost of software is
    justified. Denormal numbers are such a feature.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Tue Oct 14 10:47:56 2025
    From Newsgroup: comp.arch

    On 14/10/2025 09:51, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

    I entirely agree. If you have a type that has some kind of non-value,
    and it might contain that representation, you have to take that into
    account in your code. It's much the same thing as having a pointer that
    could be a null pointer. But as long as you are aware of the
    possibility and consequences of NaNs, they can be useful.


    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?


    I'm sure there are a number of interesting ways to model this kind of
    thing, in a programming language that supported it. NaN's in floating
    point are somewhat akin to error values in C++ std::expected<>, or empty std::optional<> types, or like "result" types found in many languages.

    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)?

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.


    Sure. My thoughts with NaN are that it might be appropriate for a
    floating point model (not IEEE) to return a NaN in circumstances where
    IEEE says the result is a denormal - I think that might have been a more useful result. And my mention of infinity is because often when people
    have a very small value but are very keen on it not being zero, it is
    because they intend to divide by it and want to avoid division by zero
    (and thus infinity).

    I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    OK. (I like that aspect of signed integer overflow being UB - more of
    your usual assumptions hold.)

    However, if "a" or "b" could be a NaN or an infinity, does that
    equivalence still hold? I do not know the details here - it is simply
    not something that turns up in the kind of coding I do. (In my line of
    work, floating point values and expression results are always "normal",
    if that is the correct term. I can always use gcc's "-ffast-math", and
    I think a lot of real-world floating point code could do so - but I
    fully appreciate that does not apply to all code.)

    Are you thinking of this equivalence as something the compiler would do
    in optimisation, or something programmers would use when writing their code?


    perhaps
    calculations should be re-arranged, algorithms changed, or you should be
    using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).

    The first two require more knowledge about FP than many programmers
    have, all just to avoid some hardware cost. Not a good idea in any
    area where the software crisis* is relevant. The last increases the
    resource usage much more than proper support for denormals.

    I fully agree on both these points. However, I can't help feeling that
    if you are seeing denormals, you are unlikely to be getting results from
    your code that are as accurate as you had expected - your calculations
    are numerically unstable. Denormals might give you slightly more leeway before everything falls apart, but only a tiny amount. Doing it right
    is going to cost you, in development time or runtime efficiency, but
    that's better than getting the wrong answers quickly!



    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware. So that's my decision
    criterion: If the software cost is higher than the hardware cost,
    the software crisis is relevant; and in the present context, it
    means that expending hardware to reduce the cost of software is
    justified. Denormal numbers are such a feature.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 11:26:10 2025
    From Newsgroup: comp.arch

    David Brown <[email protected]> writes:
    On 14/10/2025 09:51, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of >>> NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what
    programmers tend to expect. So NaNs have their pitfalls.

    I entirely agree. If you have a type that has some kind of non-value,
    and it might contain that representation, you have to take that into
    account in your code. It's much the same thing as having a pointer that >could be a null pointer.

    Not really:

    * Null pointers don't materialize spontaneously as results of
    arithmetic operations. They are stored explicitly by the
    programmer, making the programmer much more aware of their
    existence.

    * Programmers are trained to check for null pointers. And if they
    forget such a check, the result usually is that the program traps,
    usually soon after the place where the check should have been. With
    a NaN you just silently execute the wrong branch of an IF, and later
    you wonder what happened.

    * The most common use for null pointers is terminating a linked list
    or other recursive data structure. Programmers are trained to deal
    with the terminating case in their code.

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.


    Sure. My thoughts with NaN are that it might be appropriate for a
    floating point model (not IEEE) to return a NaN in circumstances where
    IEEE says the result is a denormal - I think that might have been a more >useful result.

    When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
    certain kind of exception. Maybe you can also set it up such that it
    produces a NaN instead. I doubt that many people would find that
    useful, however.


    And my mention of infinity is because often when people
    have a very small value but are very keen on it not being zero, it is >because they intend to divide by it and want to avoid division by zero
    (and thus infinity).

    Denormals don't help much here. IEEE doubles cannot represent 2^1024,
    but denormals allow to represent positive numbers down to 2^-1074.
    So, with denormal numbers, the absolute value of your divisor must be
    less than 2^-50 to produce a non-infinite result where flush-to-zero
    would have produced an infinity.

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    OK. (I like that aspect of signed integer overflow being UB - more of
    your usual assumptions hold.)

    Not mine. An assumption that I like is that the associative law
    holds. It holds with -fwrapv, but not with overflow-is-undefined.

    I fail to see how declaring any condition undefined behaviour would
    increase any guarantees.

    However, if "a" or "b" could be a NaN or an infinity, does that
    equivalence still hold?

    Yes.

    If any of them is a NaN, the result is false for either comparison
    (because a-b would be NaN, and because the result of any comparison
    with a NaN is false).

    For infinity there are a number of cases

    1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
    2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
    3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
    4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
    5) inf<inf (false) vs. inf-inf=NaN<0 (false)
    6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
    7) inf<-inf (false) vs. inf--inf=inf<0 (false)
    8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

    The most interesting case here is 5), because if means that a<=b is
    not equivalent to a-b<=0, even with denormal numbers.

    Are you thinking of this equivalence as something the compiler would do
    in optimisation, or something programmers would use when writing their code?

    I was thinking about what programmers might use when writing their
    code. For compilers, having that equivalence may occasionally be
    helpful for producing better code, but if it does not hold, the
    compiler will just not use such an equivalence (once the compiler is
    debugged).

    This is an example from Kahan that stuck in my mind, because it
    appeals to me as a programmer. He has also given other examples that
    don't do that for me, but may appeal to a mathematician, phycisist or
    chemist.

    I fully agree on both these points. However, I can't help feeling that
    if you are seeing denormals, you are unlikely to be getting results from >your code that are as accurate as you had expected - your calculations
    are numerically unstable. Denormals might give you slightly more leeway >before everything falls apart, but only a tiny amount.

    I think the nicer properties (such as the equivalence mentioned above)
    is the more important benefit. And if you take a different branch of
    an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
    enough accuracy by far.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Tue Oct 14 15:37:10 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:
    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote: >>>
    The hardware designers took many years -- right through the 1990s, I>>>> think -- to be persuaded that IEEE754 really was worth implementing in
    its entirety, that the “too hard” or “too obscure” parts
    were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of “easier”.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went >>> in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological
    results right through to the end of the calculation, in a mathematically
    consistent way.

    Denormals -- aren’t they called “subnormals” now? -- are also
    about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at
    once and going straight to zero. It’s about the principle of least
    surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level>> with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that.  The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

    But I find it harder to understand why denormals or subnormals are going
    to be useful.  Ultimately, your floating point code is approximating arithmetic on real numbers.  Where are you getting your real numbers,
    and what calculations are you doing on them, that mean you are getting > results that have such a dynamic range that you are using denormals? And
    what are you doing where it is acceptable to lose some precision with
    those numbers, but not to give up and say things have gone badly wrong > (a NaN or infinity, or underflow signal)?  I have a lot of difficulty
    imagining a situation where denormals would be helpful and you haven't > got a major design issue with your code - perhaps calculations should be
    re-arranged, algorithms changed, or you should be using an arithmetic
    format with greater range (switch from single to double, double to quad,
    or use something more advanced).
    Subnormal is critical for stability of zero-seeking algorithms, i.e a
    lot of standard algorithmic building blocks.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Tue Oct 14 15:42:45 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    David Brown <[email protected]> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

    You have just named the only common pitfall, where all comparisons
    against NaN shall return false.

    You can in fact define your own

    bool IsNan(f64 x)
    {
    ((x < 0.0) | (x >= 0.0)) == false
    }

    but this depends on the compiler/optimizer not messing up.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Tue Oct 14 17:29:40 2025
    From Newsgroup: comp.arch

    On 14/10/2025 13:26, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    On 14/10/2025 09:51, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be >>>> useful to have a representation for that. The defined "viral" nature of >>>> NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have >>>> checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what
    programmers tend to expect. So NaNs have their pitfalls.

    I entirely agree. If you have a type that has some kind of non-value,
    and it might contain that representation, you have to take that into
    account in your code. It's much the same thing as having a pointer that
    could be a null pointer.

    Not really:

    * Null pointers don't materialize spontaneously as results of
    arithmetic operations. They are stored explicitly by the
    programmer, making the programmer much more aware of their
    existence.


    NaN's don't materialise spontaneously either. They can be the result of intentionally using NaN's for missing data, or when your code is buggy
    and failing to calculate something reasonable. In either case, the
    surprise happens when someone passes the non-value to code that was not expecting to have to deal with it.

    * Programmers are trained to check for null pointers. And if they
    forget such a check, the result usually is that the program traps,
    usually soon after the place where the check should have been. With
    a NaN you just silently execute the wrong branch of an IF, and later
    you wonder what happened.

    Fair enough.


    * The most common use for null pointers is terminating a linked list
    or other recursive data structure. Programmers are trained to deal
    with the terminating case in their code.

    I would disagree that this is the most common use for null pointers.
    But it certainly is /one/ use, and programmers should handle that usage correctly.

    So to sum up, there is a certain similarity, but there are also
    significant differences.


    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.


    Sure. My thoughts with NaN are that it might be appropriate for a
    floating point model (not IEEE) to return a NaN in circumstances where
    IEEE says the result is a denormal - I think that might have been a more
    useful result.

    When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
    certain kind of exception. Maybe you can also set it up such that it produces a NaN instead. I doubt that many people would find that
    useful, however.


    And my mention of infinity is because often when people
    have a very small value but are very keen on it not being zero, it is
    because they intend to divide by it and want to avoid division by zero
    (and thus infinity).

    Denormals don't help much here. IEEE doubles cannot represent 2^1024,
    but denormals allow to represent positive numbers down to 2^-1074.
    So, with denormal numbers, the absolute value of your divisor must be
    less than 2^-50 to produce a non-infinite result where flush-to-zero
    would have produced an infinity.


    OK.

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    OK. (I like that aspect of signed integer overflow being UB - more of
    your usual assumptions hold.)

    Not mine. An assumption that I like is that the associative law
    holds. It holds with -fwrapv, but not with overflow-is-undefined.

    I fail to see how declaring any condition undefined behaviour would
    increase any guarantees.

    The associative law holds fine with UB on overflow, as do things like
    adding a positive number to an integer makes it bigger. But this is all straying from the discussion on floating point, and I suspect that we'd
    just re-hash old disagreements rather than starting new and interesting
    ones :-)


    However, if "a" or "b" could be a NaN or an infinity, does that
    equivalence still hold?

    Yes.

    If any of them is a NaN, the result is false for either comparison
    (because a-b would be NaN, and because the result of any comparison
    with a NaN is false).

    For infinity there are a number of cases

    1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
    2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
    3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
    4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
    5) inf<inf (false) vs. inf-inf=NaN<0 (false)
    6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
    7) inf<-inf (false) vs. inf--inf=inf<0 (false)
    8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

    The most interesting case here is 5), because if means that a<=b is
    not equivalent to a-b<=0, even with denormal numbers.


    Any kind of arithmetic with infinities is going to be awkward in some way!

    Are you thinking of this equivalence as something the compiler would do
    in optimisation, or something programmers would use when writing their code?

    I was thinking about what programmers might use when writing their
    code. For compilers, having that equivalence may occasionally be
    helpful for producing better code, but if it does not hold, the
    compiler will just not use such an equivalence (once the compiler is debugged).


    Sure.

    This is an example from Kahan that stuck in my mind, because it
    appeals to me as a programmer. He has also given other examples that
    don't do that for me, but may appeal to a mathematician, phycisist or chemist.


    Fair enough.

    I fully agree on both these points. However, I can't help feeling that
    if you are seeing denormals, you are unlikely to be getting results from
    your code that are as accurate as you had expected - your calculations
    are numerically unstable. Denormals might give you slightly more leeway
    before everything falls apart, but only a tiny amount.

    I think the nicer properties (such as the equivalence mentioned above)
    is the more important benefit. And if you take a different branch of
    an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
    enough accuracy by far.


    Well, I think that if your values are getting that small enough to make denormal results, your code is at least questionable. I am not
    convinced that the equivalency you mentioned above is enough to make
    denormals worth the effort, but that may be just the kind of code I
    write. (And while I did study some of this stuff - numerical stability
    - in my mathematics degree, it was quite a long time ago.)

    Thanks for the comprehensive and educational information here. It is appreciated.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Oct 14 15:31:00 2025
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 10/13/2025 4:53 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <[email protected]d> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently
    quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the
    pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the >> rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the
    programmers had compensated for the MIPS issues in code rather than via
    traps).

    And this is why FP wants high quality implementation.


    From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

    In the above sentence I was talking about your FPU not getting
    an infinitely correct result and then rounding to container size.
    Not about the other "other" anomalies" many of which can be dealt
    with in SW.

    But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
    moving platforms gradually creeping away from the origin, etc.





    Though, reading some stuff, implies a predecessor chip (the R4000) had a >> more functionally complete FPU. So, I guess it is also possible that the >> R4300 had a more limited FPU to make it cheaper for the embedded market. >>

    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.

    Do it right or don't do it at all.


    ?...

    The traps route sorta worked OK in a lot of the MIPS era CPUs.
    But, it will be opt-in via an FPSCR flag.
    If the flag is not set, it will not trap.

    But their combination of HW+SW gets the right answer.
    Your multiply does not.

    Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.

    The 5-bang instructions as used by HW+SW has to computer the result
    to infinite precision and then round to container size.

    The paper illustrates CRAY 1,... FP was fast but inaccurate enough
    to fund an army of numerical analysists to see if the program was
    delivering acceptable results.

    IEEE 754 got rid of the army of Numerical Analysists.
    But now, nobody remembers how bad is was/can be.

    For Binary128, real HW support is not likely to happen. The main reason
    to consider trap-only Binary128 is more because it has less code
    footprint than using runtime calls.

    Nobody is asking for that.

    <snip>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Oct 14 15:34:23 2025
    From Newsgroup: comp.arch


    David Brown <[email protected]> posted:

    On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:
    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing in >>> its entirety, that the “too hard” or “too obscure” parts were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of “easier”.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went >> in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

    Denormals -- aren’t they called “subnormals” now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. It’s about the principle of least surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

    MAX( x, NaN ) is x.

    But I find it harder to understand why denormals or subnormals are going
    to be useful.

    1/Big_Num does not underflow .............. completely.

    Ultimately, your floating point code is approximating arithmetic on real numbers.

    Don' make me laugh.

    Where are you getting your real numbers,
    and what calculations are you doing on them, that mean you are getting results that have such a dynamic range that you are using denormals?
    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)? I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code - perhaps
    calculations should be re-arranged, algorithms changed, or you should be using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Oct 14 15:47:20 2025
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    David Brown <[email protected]> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

    Many ISAs and many programs have trouble in getting NaNs into the
    ELSE-clause. One cannot use deMorgan's Law to invert conditions in
    the presence of NaNs.

    We (Brain, Thomas and I) went to great pain to have FCMP deliver a
    bit pattern where one could invert the condition AND still deliver
    the NaN to the expected Clause. We threw in Ordered and Totally-
    Ordered at the same time, along with OpenCL FP CLASS() function.

    Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
    semantics.

    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?

    You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly >wrong (a NaN or infinity, or underflow signal)?

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.

    The worst of all possible results is no information whatsoever.

    I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    perhaps
    calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to >double, double to quad, or use something more advanced).

    The first two require more knowledge about FP than many programmers
    have,

    Don't allow THOSE programmers to program FP codes !!
    Get ones that understand the nuances.

    all just to avoid some hardware cost. Not a good idea in any
    area where the software crisis* is relevant.

    Windows 7 and Office 2003 were good enough. That would have allowed
    zillions of programmers to go address the software crisis after being
    freed from projects that had become good enough not to need continual
    work.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Tue Oct 14 16:48:50 2025
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    David Brown <[email protected]> posted:

    Ultimately, your floating point code is approximating
    arithmetic on real numbers.

    Don' make me laugh.

    Somebody (not me) recently added the following to the gcc bugzilla
    quip file:

    The "real" type in fortran is called "real" because the
    mathematician should not notice that it has finite decimal places
    and forget that one needs lenghty adaptions of the proofs for
    that....
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 16:46:03 2025
    From Newsgroup: comp.arch

    David Brown <[email protected]> writes:
    The associative law holds fine with UB on overflow,

    With 32-bit ints:

    The result of (2000000000+2000000000)+(-2000000000) is undefined.

    The result of 2000000000+(2000000000+(-2000000000)) is 2000000000.

    So, the associative law does not hold.

    With -fwrapv both are defined to produce 2000000000, and the
    associative law holds because modulo arithmetic is associative.

    Well, I think that if your values are getting that small enough to make >denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration. Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 17:26:16 2025
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:

    [email protected] (Anton Ertl) posted:
    [...]
    Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
    semantics.

    That may be a good idea. You can write it in current languages as
    follows:

    if (a<b) {
    ...
    } else if (a>=b) {
    ...
    } else {
    ... NaN case ...
    }

    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?

    You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

    I don't think anything about FCMP. What I wrote above is about
    programming languages. I.e., a<b would trap if a or b is a NaN, while lt_or_nan(a,b) would be true if a or b is a NaN, and
    lt_and_not_nan(a,b) would be false if a or b is a NaN. I think the
    IEEE754 people have better names for these comparisons, but am too
    lazy to look them up.

    The first two require more knowledge about FP than many programmers
    have,

    Don't allow THOSE programmers to program FP codes !!
    Get ones that understand the nuances.

    We can all wish for Kahan writing all FP code, but that only deepens
    the software crisis. Educating programmers is certainly a worthy
    undertaking, but providing a good foundation for them to build on
    helps those programmers as well as those that are less educated.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Tue Oct 14 12:45:08 2025
    From Newsgroup: comp.arch

    On 10/14/2025 10:31 AM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 10/13/2025 4:53 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <[email protected]d> writes:
    After reading that article, I looked for more information on other >>>>>> processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently
    quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist >>>>> probably the best-known example (denormal support only in software), >>>>> and Linus Torvalds worked on it personally. Concerning exposing the >>>>> pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the >>>> rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so >>>> a lot of the N64 games suffer drift and other issues over time (as the >>>> programmers had compensated for the MIPS issues in code rather than via >>>> traps).

    And this is why FP wants high quality implementation.


    From what I gather, it was a combination of Binary32 with DAZ/FTZ and
    truncate rounding. Then, with emulators running instead on hardware with
    denormals and RNE.

    In the above sentence I was talking about your FPU not getting
    an infinitely correct result and then rounding to container size.
    Not about the other "other" anomalies" many of which can be dealt
    with in SW.


    This mostly applies to FMUL, but:
    I had already added a trap case for this as well.

    In the cases where the all the low-order bits of either input are 0,
    then the low-order results would also be 0 and so are N/A (the final
    result would be the same either way).

    If both sets of low-order bits are non-zero, it can trap.
    This does mean that the software emulation will need to provide a full
    width result though.

    Checking for non-zero here being more cost-effective than actually doing
    a full width multiply.


    Also, RISC-V FMADD.D and similar are sorta also going to end up as traps
    due to the lack of single-rounded FMA (though had debated whether to
    have a separate control-flag for this to still allow non-slow FMADD.D
    and similar; but as-is, these will trap).



    For FADD:
    The shifted-right bits that fall off the bottom (of the slightly-wider internal mantissa) don't matter, since they were always being added to
    0, which can't generate any carry.

    For FSUB, it may matter, but more in the sense that one can check
    whether the "fell off the bottom" part had non-zero bits and use this to adjust the carry-in part of the subtractor (since non-zero bits would
    absorb the carry-propagation of adding 1 to the bottom of a
    theoretically arbitrarily wide twos complement negation).

    So, in theory, can be dealt with in hardware to still give an exact result.


    There are still some sub-ULP bits, so the complaints about the lack of
    guard bit doesn't really apply.


    Also apparently the Cray used a non-normalized floating point format (no hidden bit), which was odd (and could create its own issues).

    Though, potentially a non-normalized format with lax normalization could
    allow for cheaper re-normalization (even if it could require
    re-normalization logic for FMUL). Though, for such a format, there is
    the possibility that someone could make re-normalization be its own instruction (allowing for an FPU with less latency).


    But, the result was that the games would work correctly on the original
    hardware, but in the emulators things would drift; like things like
    moving platforms gradually creeping away from the origin, etc.





    Though, reading some stuff, implies a predecessor chip (the R4000) had a >>>> more functionally complete FPU. So, I guess it is also possible that the >>>> R4300 had a more limited FPU to make it cheaper for the embedded market. >>>>

    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core;
    RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs) >>>> Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either. >>>
    Do it right or don't do it at all.


    ?...

    The traps route sorta worked OK in a lot of the MIPS era CPUs.
    But, it will be opt-in via an FPSCR flag.
    If the flag is not set, it will not trap.

    But their combination of HW+SW gets the right answer.
    Your multiply does not.


    As noted above, I was already working on this.


    Or, is the argument here that sticking with weaker not-quite IEEE FPU is
    preferable to using trap handlers.

    The 5-bang instructions as used by HW+SW has to computer the result
    to infinite precision and then round to container size.

    The paper illustrates CRAY 1,... FP was fast but inaccurate enough
    to fund an army of numerical analysists to see if the program was
    delivering acceptable results.

    IEEE 754 got rid of the army of Numerical Analysists.
    But now, nobody remembers how bad is was/can be.



    OK.

    As can be noted, for scalar operations I consider there to be a limit as
    to how bad is "acceptable".

    For SIMD operations, it is a little looser.
    For example, the ability to operate on integer values and get exact
    results is basically required for scalar operations, but optional for SIMD.

    Though, in this case it is a case of both Quake and also some JavaScript
    VMs relying on the ability to express integer values as floating-point
    numbers and use them in calculations as such (so, for example, if the operations don't give exact results then the programs break).


    For Binary128, real HW support is not likely to happen. The main reason
    to consider trap-only Binary128 is more because it has less code
    footprint than using runtime calls.

    Nobody is asking for that.


    OK.


    Can note that in my looking, it seems like:
    Pretty much none of the ASIC implementations support the Q extension;
    It is not required in any of the mainline profiles;
    Implementing Q proper would have non-zero impact on RV64G:
    The differences between F+D and F+D+Q being non-zero.
    Whereas, "fudging it" can retain strict compatibility with D.
    Where, people actually use 'D'.

    There is a non-zero amount of code using "long double", but in this case
    the bigger issue is more the code footprint of the associated
    long-double math functions rather than performance (say, if someone uses "cosl()" or similar).

    Still not ideal, as (with my existing ISA extensions) there is still no single-instruction way to load a 64-bit value into an FPR.

    But, could at least reduce it from 11 (44 bytes) instructions to 3 (20
    bytes; "LI-Imm33; SHORI-Imm32; FMV.D.X"). This still means 40 bytes to
    load a full-width Binary128 literal.
    Loading the same literal would need 24 bytes in XG3.
    And, an unrolled Taylor expansion uses a lot of them.

    With Q proper? Only option would be to use memory loads here.
    Like, these the C math functions are annoyingly bulky in this case.


    Meanwhile, elsewhere saw a mention that apparently to deal with RISC-V fragmentation issues, there is now being work on a mechanism to allow modification of the RISC-V instruction listings in GCC without needing
    to modify the code in GCC proper each time (basically hot injecting
    stuff into the instruction listing and similar).

    As apparently having everyone trying to modify the ISA every which way
    is making a bit of an awful mess of things.

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 03:45:31 2025
    From Newsgroup: comp.arch

    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware.

    The “crisis” was supposed to do with the shortage of programs to write all the programs that were needed to solve business and user needs.

    By that definition, I don’t think the “crisis” exists any more. It went away with the rise of very-high-level languages, from about the time of
    those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 03:47:14 2025
    From Newsgroup: comp.arch

    On Tue, 14 Oct 2025 15:47:20 GMT, MitchAlsup wrote:

    Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
    semantics.

    All the good languages have IEEE754 compliant arithmetic libraries,
    including type queries for things like isnan().

    E.g. <https://docs.python.org/3/library/math.html>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Oct 15 05:55:40 2025
    From Newsgroup: comp.arch

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:
    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware.

    The "crisis" was supposed to do with the shortage of programs to write all >the programs that were needed to solve business and user needs.

    I never heard that one. The software project failures, deadline
    misses, and cost overruns, and their increasing number was a symptom
    that is reflected in the Wikipedia article.

    By that definition, I don’t think the "crisis" exists any more. It went >away with the rise of very-high-level languages, from about the time of >those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked. There has
    been quite a bit of work on adding static typechecking to some of
    these languages in the last decade or so, and the motivation given for
    that is difficulties in large software projects using these languages.

    In any case, even with these languages there are still software
    projects that fail, miss their deadlines and have overrun their
    budget; and to come back to the criterion I mentioned, where software
    cost is higher than hardware cost.

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Oct 15 12:41:28 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 05:55:40 GMT
    [email protected] (Anton Ertl) wrote:
    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:
    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a
    useful definition for deciding whether there is a software crisis
    or not, and it does not even mention the symptom that was
    mentioned first when I learned about the software crisis (in
    1986): The cost of software exceeds the cost of hardware.

    The "crisis" was supposed to do with the shortage of programs to
    write all the programs that were needed to solve business and user
    needs.

    I never heard that one. The software project failures, deadline
    misses, and cost overruns, and their increasing number was a symptom
    that is reflected in the Wikipedia article.

    By that definition, I don’t think the "crisis" exists any more. It
    went away with the rise of very-high-level languages, from about the
    time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked. There has
    been quite a bit of work on adding static typechecking to some of
    these languages in the last decade or so, and the motivation given for
    that is difficulties in large software projects using these languages.

    In any case, even with these languages there are still software
    projects that fail, miss their deadlines and have overrun their
    budget; and to come back to the criterion I mentioned, where software
    cost is higher than hardware cost.

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.
    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me.
    However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of
    Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick
    businessmen" which in my book is less derogatory.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Wed Oct 15 12:36:17 2025
    From Newsgroup: comp.arch

    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <[email protected]> writes:

    Well, I think that if your values are getting that small enough to make
    denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.


    Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point, ignoring
    NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y. Then - again in the mathematical real domain - the operation is carried out. Then the
    result is truncated or rounded to fit back within the mantissa and
    exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of
    exponent. For normal floating point values, that covers from 10 ^ -308
    to 10 ^ +308, or 716 orders of magnitude. (For comparison, the size of
    the universe measured in Planck lengths is only about 61 orders of
    magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another 16
    orders of magnitude - at the cost of rapidly decreasing precision. They
    don't stop the inevitable approximation to zero, they just delay it a
    little.

    I am still at a loss to understand how this is going to be useful - when
    will that small extra margin near zero actually make a difference, in
    the real world, with real values? When you are using your
    Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
    continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially
    when these smaller numbers have lower precision?

    I realise there are plenty of numerical calculations in which errors
    "build up", such as simulating non-linear systems over time, and there
    you are looking to get as high an accuracy as you can in the
    intermediary steps so that you can continue for longer. But even there, denormals are not going to give you more than a tiny amount extra.

    (There are, of course, mathematical problems which deal with values or precisions far outside anything of relevance to the physical world, but
    if you are dealing with those kinds of tasks then IEEE floating point is
    not going to do the job anyway.)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Wed Oct 15 12:54:30 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    David Brown <[email protected]> posted:

    On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:>>> On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I >>>>> think -- to be persuaded that IEEE754 really was worth implementing in >>>>> its entirety, that the “too hard” or “too obscure” parts were there for
    an important reason, to make programming that much easier, and should >>>>> not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs >>>> more reliable (more numerically stable) and to give the programmer a>>>> constant programming model (not easier).

    As a programmer, I count all that under my definition of “easier”.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did >>>> make it easier--but NaNs, infinities, Underflow at the Denorm level went >>>> in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological >>> results right through to the end of the calculation, in a mathematically >>> consistent way.

    Denormals -- aren’t they called “subnormals” now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at >>> once and going straight to zero. It’s about the principle of least >>> surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    MAX( x, NaN ) is x.
    That was true under 754-2008 but we fixed it for 2019: All NaNs
    propagate through the new min/max definitions. The old still exist of
    course, but they are deprecated.
    The point that made it obvious to everyone was that under the 2008
    definition an SNaN would always propage, but be converted to a QNaN, but a QNaN could silently disappear as show above.
    What this meant was that for any kind of vector reduction, the final
    result could be the NaN or any of the other input values, depending upon the order of the individual comparisons!
    I was one of the proponents who pushed this change through, but I will
    say that after we showed some of the most surprising results, everyone
    agreed to fix it. Having NaN maximally sticky is also definitely in the
    spirit of the entire 754 standard:
    The only operations that do not propagate NaN are those that explicitly
    handle this case, or those that don't return a floating point value.
    Having all compares return 'false' is an example of the latter.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Wed Oct 15 13:07:01 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <[email protected]> writes:

    Well, I think that if your values are getting that small enough to make
    denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration.  Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.


    Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point, ignoring
    NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y.  Then - again in the mathematical real domain - the operation is carried out.  Then the
    result is truncated or rounded to fit back within the mantissa and
    exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.  For normal floating point values, that covers from 10 ^ -308
    to 10 ^ +308, or 716 orders of magnitude.  (For comparison, the size of
    the universe measured in Planck lengths is only about 61 orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another 16 orders of magnitude - at the cost of rapidly decreasing precision.  They don't stop the inevitable approximation to zero, they just delay it a little.

    I am still at a loss to understand how this is going to be useful - when will that small extra margin near zero actually make a difference, in
    the real world, with real values?  When you are using your
    Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
    continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially when these smaller numbers have lower precision?
    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.
    I.e. they differ by exactly one ulp.
    As I noted, I have not been bitten by this particular issue, one of the
    reaons being that I tend to not write infinite loops inside functions,
    instead I'll pre-calculate how many (typically NR) iterations should be needed.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Oct 15 16:50:13 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 12:36:17 +0200
    David Brown <[email protected]> wrote:

    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <[email protected]> writes:

    Well, I think that if your values are getting that small enough to
    make denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of
    course you can terminate the loop while you are still far from the solution, but that's not going to improve the accuracy of the
    results.

    Feel free to correct me if what I write below is wrong - you, Terje,
    and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.
    Then - again in the mathematical real domain - the operation is
    carried out. Then the result is truncated or rounded to fit back
    within the mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
    size of the universe measured in Planck lengths is only about 61
    orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another
    16 orders of magnitude - at the cost of rapidly decreasing precision.
    They don't stop the inevitable approximation to zero, they just
    delay it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values? When you are using
    your Newton-Raphson iteration to find your function's zeros, what are
    the circumstances in which you can get a more useful end result if
    you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
    especially when these smaller numbers have lower precision?

    I realise there are plenty of numerical calculations in which errors
    "build up", such as simulating non-linear systems over time, and
    there you are looking to get as high an accuracy as you can in the intermediary steps so that you can continue for longer. But even
    there, denormals are not going to give you more than a tiny amount
    extra.

    (There are, of course, mathematical problems which deal with values
    or precisions far outside anything of relevance to the physical
    world, but if you are dealing with those kinds of tasks then IEEE
    floating point is not going to do the job anyway.)



    I don't think that I agree with Anton's point, at least as formulated.

    Yes, subnormals improve precision of Newton-Raphson and such*, but only
    when the numbers involved in calculations are below 2**-971, which does
    not happen very often. What is more important that *when* it happens
    then naively written implementations of such algorithms still converge.
    Without subnormals (or without expert provisions) there is big chance
    that they would not converge at all. That happens mostly because
    IEEE-754 preserves following intuitive invariant:
    When x > y then x - y > 0
    Without subnormals, e.g. with VAX float formats that are otherwise
    pretty good, this invariant does not hold.


    * - I personally prefer to illustrate it with cord-and-tangent
    root-finding algorithm that can be used for any type of function as
    long as you proved that on section of interest there is no change of
    sign of its first and second derivatives. May be, because I
    was taught this algorithm at age of 15. This algo can be called
    half-Newton].








    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Oct 15 17:46:21 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 13:07:01 +0200
    Terje Mathisen <[email protected]> wrote:
    David Brown wrote:
    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <[email protected]> writes:

    Well, I think that if your values are getting that small enough
    to make denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration.� Of
    course you can terminate the loop while you are still far from the
    solution, but that's not going to improve the accuracy of the
    results.

    Feel free to correct me if what I write below is wrong - you,
    Terje, and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.
    Then - again in the mathematical real domain - the operation is
    carried out.� Then the result is truncated or rounded to fit back
    within the mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.� For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude.� (For comparison,
    the size of the universe measured in Planck lengths is only about
    61 orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here -
    another 16 orders of magnitude - at the cost of rapidly decreasing precision.� They don't stop the inevitable approximation to zero,
    they just delay it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values?� When you are
    using your Newton-Raphson iteration to find your function's zeros,
    what are the circumstances in which you can get a more useful end
    result if you continue to 10 ^ -324 instead of treating 10 ^ -308
    as zero - especially when these smaller numbers have lower
    precision?

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least
    some zero-seeking algorithms will stabilize on an exact value, if and
    only if you have subnormals, otherwise it is possible to wobble back
    & forth between two neighboring results.

    I.e. they differ by exactly one ulp.

    As I noted, I have not been bitten by this particular issue, one of
    the reaons being that I tend to not write infinite loops inside
    functions, instead I'll pre-calculate how many (typically NR)
    iterations should be needed.

    Terje
    It does not sound right to me. Newton-alike iterations oscillations by
    1 ULP could happen even with subnormals. They should be taken care of by properly written exit conditions.
    What could happen without subnormals are oscillations by *more* than 1
    ULP, sometimes much more.
    Also in absence of subnormals one can suffer divisions by zero in code
    like below:
    while (fb > fa) {
    a -= b*fa/(fb - fa);
    ...
    }
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Wed Oct 15 16:53:33 2025
    From Newsgroup: comp.arch

    On 15/10/2025 13:07, Terje Mathisen wrote:
    David Brown wrote:
    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <[email protected]> writes:

    Well, I think that if your values are getting that small enough to make >>>> denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration.  Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.


    Feel free to correct me if what I write below is wrong - you, Terje,
    and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.  Then
    - again in the mathematical real domain - the operation is carried
    out.  Then the result is truncated or rounded to fit back within the
    mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of
    exponent.  For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude.  (For comparison, the
    size of the universe measured in Planck lengths is only about 61
    orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another
    16 orders of magnitude - at the cost of rapidly decreasing precision.
    They don't stop the inevitable approximation to zero, they just delay
    it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values?  When you are using
    your Newton-Raphson iteration to find your function's zeros, what are
    the circumstances in which you can get a more useful end result if you
    continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
    especially when these smaller numbers have lower precision?

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
    you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I.e. they differ by exactly one ulp.

    I have no problems believing that this can occur on occasion. No matter
    what range you pick for your floating point formats, or what precision
    you pick, you will always be able to find examples of this kind of
    algorithm that home in on the right value with the format you have
    chosen but would fail with just one bit less. I just don't think that
    such pathological examples mean that subnormals are important.

    But if such cases occur regularly in real-world calculations, not just artificial examples, then it's a different matter.


    As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Wed Oct 15 17:52:48 2025
    From Newsgroup: comp.arch

    On 15/10/2025 15:50, Michael S wrote:
    On Wed, 15 Oct 2025 12:36:17 +0200
    David Brown <[email protected]> wrote:

    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <[email protected]> writes:

    Well, I think that if your values are getting that small enough to
    make denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration. Of
    course you can terminate the loop while you are still far from the
    solution, but that's not going to improve the accuracy of the
    results.

    Feel free to correct me if what I write below is wrong - you, Terje,
    and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.
    Then - again in the mathematical real domain - the operation is
    carried out. Then the result is truncated or rounded to fit back
    within the mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of
    exponent. For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
    size of the universe measured in Planck lengths is only about 61
    orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another
    16 orders of magnitude - at the cost of rapidly decreasing precision.
    They don't stop the inevitable approximation to zero, they just
    delay it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values? When you are using
    your Newton-Raphson iteration to find your function's zeros, what are
    the circumstances in which you can get a more useful end result if
    you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
    especially when these smaller numbers have lower precision?

    I realise there are plenty of numerical calculations in which errors
    "build up", such as simulating non-linear systems over time, and
    there you are looking to get as high an accuracy as you can in the
    intermediary steps so that you can continue for longer. But even
    there, denormals are not going to give you more than a tiny amount
    extra.

    (There are, of course, mathematical problems which deal with values
    or precisions far outside anything of relevance to the physical
    world, but if you are dealing with those kinds of tasks then IEEE
    floating point is not going to do the job anyway.)



    I don't think that I agree with Anton's point, at least as formulated.

    Yes, subnormals improve precision of Newton-Raphson and such*, but only
    when the numbers involved in calculations are below 2**-971, which does
    not happen very often. What is more important that *when* it happens
    then naively written implementations of such algorithms still converge. Without subnormals (or without expert provisions) there is big chance
    that they would not converge at all. That happens mostly because
    IEEE-754 preserves following intuitive invariant:
    When x > y then x - y > 0
    Without subnormals, e.g. with VAX float formats that are otherwise
    pretty good, this invariant does not hold.


    I can appreciate that you can have x > y, but with such small x and y
    and such close values that (x - y) is a subnormal - thus without
    subnormals, (x - y) would be 0.

    Perhaps I am being obtuse, but I don't see how you would write a Newton-Raphson algorithm that would fail to converge, or fail to stop,
    just because you don't have subnormals. Could you give very rough
    outline of such problematic code?


    * - I personally prefer to illustrate it with cord-and-tangent
    root-finding algorithm that can be used for any type of function as
    long as you proved that on section of interest there is no change of
    sign of its first and second derivatives. May be, because I
    was taught this algorithm at age of 15. This algo can be called
    half-Newton].


    I was perhaps that age when I first came across Newton-Raphson in a
    maths book, and wrote an implementation for it on a computer. That was
    in BBC Basic, and I'm pretty sure that the floating point type there was
    not IEEE compatible, and did not support such fancy stuff as subnormals!
    But I am also very sure I did not push the program to more difficult examples. (But it did show nice graphic illustrations of what it was
    doing.)

    It was also around then that I wrote a program for matrix inversion, and discovered the joys of numeric instability, and thus the need for care
    when picking the order for Gaussian elimination.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Wed Oct 15 13:22:01 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Wed, 15 Oct 2025 05:55:40 GMT
    [email protected] (Anton Ertl) wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:
    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a
    useful definition for deciding whether there is a software crisis
    or not, and it does not even mention the symptom that was
    mentioned first when I learned about the software crisis (in
    1986): The cost of software exceeds the cost of hardware.
    The "crisis" was supposed to do with the shortage of programs to
    write all the programs that were needed to solve business and user
    needs.
    I never heard that one. The software project failures, deadline
    misses, and cost overruns, and their increasing number was a symptom
    that is reflected in the Wikipedia article.

    By that definition, I donג��t think the "crisis" exists any more. It >>> went away with the rise of very-high-level languages, from about the
    time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked. There has
    been quite a bit of work on adding static typechecking to some of
    these languages in the last decade or so, and the motivation given for
    that is difficulties in large software projects using these languages.

    In any case, even with these languages there are still software
    projects that fail, miss their deadlines and have overrun their
    budget; and to come back to the criterion I mentioned, where software
    cost is higher than hardware cost.

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost? When it affects many programmers and especially if the
    difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton

    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    My idea is for the CHKcc instruction to copy the test value to a dest
    register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    Because there is no branch, there is no way to speculate around the check
    (but load value speculation could negate this fix).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:09:27 2025
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    ----------------------------

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost?

    Most people would say:: "When it adds performance" AND the compiler
    can use it. Some would add: "from unmodified source code"; but I
    am a little wishy-washy on the last clause.

    I might note that SIMD obeys none of the 3 conditions.

    When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Printf-family "closes more of the gap" than EDIT ever could. And there
    is a whole suite of things better off left in subroutines than being
    raised into Instructions.

    Unfortunately, elementary FP functions are no longer in that category.
    When one can perform SIN(x) along with argument reduction and polynomial calculation in the cycle time of FDIV, SIN() deserves to be a first
    class member of the instruction set--especially if the HW cost is
    "not that much".

    On the other hand: things like polynomial evaluating instructions
    seem a bridge too far as you have to pick for all time 1 of {Horner,
    Estrin, Padé, Power Series, Clenshaw, ...} and at some point it
    becomes better to start using FFT-derived evaluation means.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Arguably, the best thing to do here is to Trap on the creation of deNorms.
    At least then you can see them and do something about them at the algorithm level. {Gee Whiz Cap. Obvious: IEEE 754 already did this!}

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    My 66000 is immune from Spectré; µA state is not updated until retire.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    We just don't have the smoking gun of a missing $1M-to-$1B to make it
    worth the effort to do something about it. But mark my words:: the vulnerability is being exploited ...

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:13:53 2025
    From Newsgroup: comp.arch


    Michael S <[email protected]> posted:

    -------------------------------
    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton

    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.

    My 66000 allows an application to crap all over "the stack";
    but it does provide a means whereby "crapping all over the stack"
    does not allow the application to violate the contract between caller
    and callee. Once application performs a RET (or EXIT) control is returns
    to caller 1 instruction past calling point, and with the preserved
    registers preserved !

    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:28:52 2025
    From Newsgroup: comp.arch


    EricP <[email protected]> posted:
    ---------------------------
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    Yes, order in OoO is sanity-impairing.

    But, what you do know is that CHKx will be performed before LD can
    retire. _AND_ if your µA does not update µA state prior to retire,
    you can be as OoO as you like and still not be Spectré sensitive.

    One of the things recently put into My 66000 is that AGEN detects
    overflow and raises PageFault.

    My idea is for the CHKcc instruction to copy the test value to a dest register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    If you follow my rule above this is unnecessary, but it may be less
    painful than holding back state update until retire.

    Because there is no branch, there is no way to speculate around the check (but load value speculation could negate this fix).

    x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
    and µfaults when shift count == 0 and prevents setting of CFLAGS.
    You "COULD" do something similar at µA level.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:34:14 2025
    From Newsgroup: comp.arch


    Terje Mathisen <[email protected]> posted:
    ----------------------

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
    you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I know of several Newton-Raphson-iterations that converge faster and
    more accurately using reciprocal-SQRT() than the equivalent algorithm
    using SQRT() directly in NR-iteration.

    I.e. they differ by exactly one ulp.

    In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more
    accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

    As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

    Almost always the right course of events.

    The W() function may be different. W( poly×(e^poly) ) = poly.

    Terje
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 21:37:42 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

    Most people would say:: "When it adds performance" AND the compiler can
    use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

    I might note that SIMD obeys none of the 3 conditions.

    I believe GCC can do auto-vectorization in some situations.

    But the RISC-V folks still think Cray-style long vectors are better than
    SIMD, if only because it preserves the “R” in “RISC”.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 21:42:32 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence D’Oliveiro wrote:

    The "crisis" was supposed to do with the shortage of programs to write
    all the programs that were needed to solve business and user needs.

    By that definition, I don’t think the "crisis" exists any more. It went
    away with the rise of very-high-level languages, from about the time of
    those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked.

    Correct. That does seem to be a key part of what “very-high-level” means.

    There has been quite a bit of work on adding static typechecking to some
    of these languages in the last decade or so, and the motivation given
    for that is difficulties in large software projects using these
    languages.

    What we’re seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
    advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

    In any case, even with these languages there are still software projects
    that fail, miss their deadlines and have overrun their budget ...

    I’m not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key
    point about using such a very-high-level language is you can do a lot in
    just a few lines of code.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 22:19:18 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

    Most people would say:: "When it adds performance" AND the compiler can
    use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

    I might note that SIMD obeys none of the 3 conditions.

    I believe GCC can do auto-vectorization in some situations.

    Yes, 28 YEARS after it was first put in !! it danged better be
    able !?! {yes argue about when}

    My point was that you don't put it in until you can see a performance
    advantage in the very next (or internal) compiler. {Where 'you' are
    the designers of that generation.

    But the RISC-V folks still think Cray-style long vectors are better than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors
    (or vice versa)--they simply represent different ways of shooting
    yourself in the foot.

    No ISA with more than 200 instructions deserves the RISC mantra.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 22:31:32 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence D’Oliveiro wrote:

    The "crisis" was supposed to do with the shortage of programs to write
    all the programs that were needed to solve business and user needs.

    By that definition, I don’t think the "crisis" exists any more. It went >> away with the rise of very-high-level languages, from about the time of
    those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked.

    Correct. That does seem to be a key part of what “very-high-level” means.

    There has been quite a bit of work on adding static typechecking to some
    of these languages in the last decade or so, and the motivation given
    for that is difficulties in large software projects using these
    languages.

    What we’re seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
    advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

    There is a pernicious trap:: once an application written in a VHLL
    is acclaimed by the masses--it instantly falls into the trap where
    "users want more performance":: something the VHLL cannot provide
    until they.........

    45 years ago it was LISP, you wrote the application in LISP to figure
    out the required algorithms and once you got it working, you rewrote
    it in a high-performance language (FORTRAN or C) so it was usably fast.

    History has a way of repeating itself, when no-one remembers the past.

    In any case, even with these languages there are still software projects that fail, miss their deadlines and have overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out how to
    make the (17 kinds of) hammers one needs, there it little need to make a
    new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have been happier... The mouse was more precise in W7 than in W8 ... With a little upgrade for new PCIe architecture along the way rather than redesigning
    whole kit and caboodle for tablets and phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
    and few people would have cared. Many SW projects are driven not by demand
    for the product, but pushed by companies to make already satisfied users
    have to upgrade.

    Those programmers could have transitioned to new SW projects rather than redesigning the same old thing 8 more times. Presto, there is now enough
    well trained SW engineers to tackle the undone SW backlog.

    I’m not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key point about using such a very-high-level language is you can do a lot in just a few lines of code.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Thu Oct 16 05:44:04 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

    Also there might be some pipeline benefits in having longer vector
    operands ... I’ll bow to your opinion on that.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the “R” stands for.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Thu Oct 16 05:57:34 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup wrote:

    On Wed, 15 Oct 2025 21:42:32 -0000 (UTC), Lawrence D’Oliveiro wrote:

    What we’re seeing here is a downward creep, as those very-high-level
    languages (Python and JavaScript, particularly) are encroaching into
    the territory of the lower levels. Clearly they must still have some
    advantages over those languages that already inhabit the lower levels,
    otherwise we might as well use the latter.

    There is a pernicious trap:: once an application written in a VHLL is acclaimed by the masses--it instantly falls into the trap where "users
    want more performance":: something the VHLL cannot provide until they.........

    45 years ago it was LISP, you wrote the application in LISP to figure
    out the required algorithms and once you got it working, you rewrote it
    in a high-performance language (FORTRAN or C) so it was usably fast.

    No, you didn’t. There is a Pareto rule in effect, in that the majority of the CPU time (say, 90%) is spent in a minority of the code (say, 10%). So having got your prototype working, and done suitable profiling to identify
    the bottlenecks, you concentrate on optimizing those bottlenecks, not on rewriting the whole app.

    Paul Graham (well-known LISP guru) described how the company he was with
    -- one of the early Dotcom startups -- wrote Orbitz, an airline
    reservation system, in LISP. But the most performance critical part was
    done in C++.

    Nowadays, with the popularity of Python, we already have lots of efficient lower-level toolkits to take care of common tasks, taking advantage of the versatility of the core Python language. For example, NumPy for handling serious number-crunching: you write a few lines of Python, to express a high-level operation that crunches a million sets of numbers in just a few seconds.

    Maybe it only took you a minute to come up with the line of code; maybe
    you will never need to run it again. Writing a program entirely in FORTRAN
    or C to perform the same operation might take an expert programmer an hour
    or two, say; in that time, the Python programmer could try out dozens of similar operations, maybe discard the results of three quarters of them,
    to narrow down the important information to be extracted from the raw
    data.

    That’s the kind of productivity gain we enjoy nowadays, on a routine
    basis, without making a big deal about it in news headlines. And that’s
    why we don’t talk about a “software crisis” any more.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Thu Oct 16 09:04:23 2025
    From Newsgroup: comp.arch

    On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
    4 ints at a time, or 8 ints, or 16 ints - it's all different
    instructions using different SIMD registers. With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
    not exposed to the ISA and you have the same code no matter how wide the actual execution units are. I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
    akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.


    Also there might be some pipeline benefits in having longer vector
    operands ... I’ll bow to your opinion on that.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the “R” stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Thu Oct 16 07:00:58 2025
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
    browser.

    As for the browsers, AFAIK they tried to make Spectre leak less by
    making the clock less precise. That does not stop Spectre, it only
    makes data extraction using the clock slower. Moreover, there are
    ways to work around that by running a timing loop, i.e., instead of
    the clock you use the current count of the counted loop.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.

    What do you mean with "mitigated in hardware"? The answers to
    hardware vulnerabilities are to either fix the hardware (for Spectre
    "invisible speculation" looks the most promising to me), or to leave
    the hardware vulnerable and mitigate the vulnerability in software
    (possibly supported by hardware or firmware changes that do not fix
    the vulnerability).

    So do you not want it to be fixed in hardware, or not mitigated in
    software? As long as the hardware is not fixed, you may not have a
    choice in the latter, unless you use an OS you write yourself. AFAIK
    you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
    slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations
    disabled.

    So if you are against hardware fixes, you will pay for software
    mitigations, in development cost (possibly indirectly) and in
    performance.

    More info on the topic:

    Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Thu Oct 16 11:34:20 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <[email protected]> posted:
    ----------------------

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some
    zero-seeking algorithms will stabilize on an exact value, if and only if
    you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I know of several Newton-Raphson-iterations that converge faster and
    more accurately using reciprocal-SQRT() than the equivalent algorithm
    using SQRT() directly in NR-iteration.

    I.e. they differ by exactly one ulp.

    In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

    Interesting! I have also found rsqrt() to be a very good building block,
    to the point where if I can only have one helper function (approximate
    lookup to start the NR), it would be rsqrt, and I would use it for all
    of sqrt, fdiv and rsqrt.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Oct 16 10:24:37 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the “R” stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.

    Looking at

    The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

    he never says what defines RISC, just what improved results
    this *design approach* should achieve.

    "Several factors indicate a Reduced Instruction Set Computer as a
    reasonable design alternative.
    ...
    Implementation Feasibility. A great deal depends on being able to fit
    an entire CPU design on a single chip.
    ...
    [EricP: reduced absolute amount of logic for a minimum implementation]

    Design Time. Design difficulty is a crucial factor in the success of
    VLSI computer.
    ...
    [EricP: reduced complexity leading to reduced design time]

    Speed. The ultimate test for cost-effectiveness is the speed at which an implementation executes a given algorithm. Better use of chip area and availability of newer technology through reduced debugging time contribute
    to the speed of the chip. A RISC potentially gains in speed merely from a simpler design.
    ...
    [EricP: reduced complexity and logic leads to reduced critical
    path lengths giving increased frequency.]

    Better use of chip area. If you have the area, why not implement the CISC?
    For a given chip area there are many tradeoffs for what can be realized.
    We feel that the area gained back by designing a RISC architecture rather
    than a CISC architecture can be used to make the RISC even more attractive
    than the CISC. ... When the CISC becomes realizable on a single chip,
    the RISC will have the silicon area to use pipelining techniques;
    when the CISC gets pipelining the RISC will have on chip caches, etc.
    ...
    [EricP: reduced waste on dragging around architectural boat anchors]

    The experience we have from compilers suggests that the burden on compiler writers is eased when the instruction set is simple and uniform.
    ...
    [EricP: reduced compiler complexity and development work]
    "

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Oct 16 10:32:21 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <[email protected]> posted:
    ---------------------------
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me.
    However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of
    Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick
    businessmen" which in my book is less derogatory.
    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    Yes, order in OoO is sanity-impairing.

    But, what you do know is that CHKx will be performed before LD can
    retire. _AND_ if your µA does not update µA state prior to retire,
    you can be as OoO as you like and still not be Spectré sensitive.

    One of the things recently put into My 66000 is that AGEN detects
    overflow and raises PageFault.

    My idea is for the CHKcc instruction to copy the test value to a dest
    register when the check is successful. This makes the dest value register
    write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    If you follow my rule above this is unnecessary, but it may be less
    painful than holding back state update until retire.

    My idea is the same as a SUB instruction with overflow detect,
    which I would already have. I like cheap solutions.

    But the core idea here, to eliminate a control flow race condition by
    changing it to a data flow dependency, may be applicable in other areas.

    Because there is no branch, there is no way to speculate around the check
    (but load value speculation could negate this fix).

    On second thought, no, load value speculation would not negate this fix.

    x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
    and µfaults when shift count == 0 and prevents setting of CFLAGS.
    You "COULD" do something similar at µA level.

    I'd prefer not to step in that cow pie to begin with.
    Then I won't have to spend time cleaning my shoes afterwards.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Oct 16 23:04:44 2025
    From Newsgroup: comp.arch

    On Thu, 16 Oct 2025 07:00:58 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
    browser.

    As for the browsers, AFAIK they tried to make Spectre leak less by
    making the clock less precise. That does not stop Spectre, it only
    makes data extraction using the clock slower. Moreover, there are
    ways to work around that by running a timing loop, i.e., instead of
    the clock you use the current count of the counted loop.


    I don't think that it was a primary mitigation of Spectre Variant 1
    implemented in browsers.
    Indeed, they made clock less precise, but that was their secondary
    line of defense, mostly aimed at new SPECTRE variants that are not
    discovered yet.
    For Spectre Variant 1 they implemented much more direct defense.
    For example, before mitigation JS statement val = x[i] was compiled to:
    cmp %RAX, 0(%RDX) # compare i with x.limit
    jbe oob_handler
    mov 8(%RDX, %RAX, 4), %RCX
    After mitigation it looks like:
    xor %ECX, %ECX
    cmp %RAX, 0(%RDX) # compare i with x.limit
    jbe oob_handler
    movbe %ECX, %EAX # data dependency prevents problematic speculation
    mov 8(%RDX, %RAX, 4), %RCX

    Almost identical code could be generated on ARM or POWER or SPARC. On
    MIPS rev6 it could be even shorter. On non-extended RISC-V it would be
    somewhat longer, but browser vendors do not care about RISC-V, extended
    or not.

    The part above written for the benefit of interested bystanders.
    You already know all that.


    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.

    What do you mean with "mitigated in hardware"? The answers to
    hardware vulnerabilities are to either fix the hardware (for Spectre "invisible speculation" looks the most promising to me), or to leave
    the hardware vulnerable and mitigate the vulnerability in software
    (possibly supported by hardware or firmware changes that do not fix
    the vulnerability).

    So do you not want it to be fixed in hardware, or not mitigated in
    software? As long as the hardware is not fixed, you may not have a
    choice in the latter, unless you use an OS you write yourself. AFAIK
    you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
    slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations disabled.

    So if you are against hardware fixes, you will pay for software
    mitigations, in development cost (possibly indirectly) and in
    performance.

    More info on the topic:

    Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

    - anton

    May be, I'll look at it some day. Certainly not tonight.
    May be, never.
    After all, neither me nor you are experts in design of modern high perf
    CPUs. So our reasonings about performance impact of this or that HW
    solution are at best educated hand wavings.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Oct 16 15:17:22 2025
    From Newsgroup: comp.arch

    On 10/16/2025 12:44 AM, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.


    With some fighting as to what exactly it means:
    Small Listing (or smallest viable listing);
    Simple Instructions (Eg: Load/Store);
    Fixed-size instructions;
    ...

    So, for RISC-V:
    First point only really holds in the case of RV64I.
    For RV64G, there is already a lot of unnecessary stuff in there.
    Second Point:
    Fails with the 'A' extension;
    Also parts of F/D.
    Third Point:
    Fails with RV-C.
    Though, people redefine it:
    Still RISC so long as not using an x86-style encoding scheme.

    Well, and still the past example of some old marketing for MSP430 trying
    to pass it off as a RISC, where it had more in common with PDP-11 than
    with any of the RISC's (and only reason listing looks tiny is by
    ignoring the special cases encoded in certain combinations of registers
    and addressing modes).

    Like, you can sweep things like immediate-form instructions when you can
    do "@PC+" and get the same effect.


    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)


    RISC-V tends to fail at this one in some areas...

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    The P extension is also a fail in this area, as they went whole-hog in defining new instructions for nearly every possible combination.



    Also there might be some pipeline benefits in having longer vector
    operands ... I’ll bow to your opinion on that.


    IME, SIMD tends to primarily show benefits with 2 and 4 element vectors.

    Most use-cases for longer vectors tend to matrix-like rather than
    vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Also, element sizes:
    Most of the dominant use-cases seem to involve 16 and 32 bit elements.
    Most cases that involve 8 bit elements are less suited to actual
    computation at 8 bits (for example, RGB math often works better at 16 bits).


    There are some weaknesses, for example, I mostly ended up dealing with
    RGB math by simply repeating the 8-bit values twice within a 16-bit spot.

    For various tasks, it might has been better to have gone with an
    unpack/repack scheme like:
    Pad2.Value8.Frac6
    Pad4.Value8.Frac4
    Where Pad can deal with values outside unit range, and Frac with values between the two LDR points. Then the RGB narrowing conversion operations
    could have had the option for round-and-saturate.

    Though, a more tacky option is to use the existing unpack operation and
    then invert the low-order bits to add a little bit of padding space for underflow/overflow.

    Another option being to use "Packed Shift" instructions to get a format
    with pad bits.


    No saturating ops in my case, as saturating ops didn't seem worth it
    (and having Wrap/SSat/USat/... is a big part of the combinatorial
    explosion seen in the P extension).



    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the “R” stands for.


    Checking, if I take XG3, and exclude SIMD, 128-bit integer instructions,
    stuff for 96-bit addressing, etc, the listing drops to around 208 instructions.

    This does still include things like instructions with niche addressing
    modes (such as "(GP,Disp16)"), etc.

    If stripped back to "core instructions" (excluding rarely-used
    instructions, such as ROT*/etc, and some of these alternate-mode
    instructions, etc), could be dropped back a little further.

    There are some instructions in the listing that would have been merged
    in RISC-V, like FPU instructions which differ only in rounding mode (the
    RNE and DYN instructions exist as separate instructions in this case, ...).


    It is a little over 400 if the SIMD and ALUX stuff and similar is added
    back in (excluding things like placeholder spots, or instructions which
    were copied from XG2 but are either N/A or redundant, ...).

    There is a fair chunk of instructions which mostly exist as SIMD format converters and similar.


    So, seems roughly:
    ~ 50%: Base instructions
    ~ 20%: ALUX and 96-bit addressing.
    ~ 30%: SIMD stuff

    Internally to the CPU core, there are roughly 44 core operations ATM,
    though many multiplex groups of related operations as sub-operations.

    So, things like ALU/CONV/etc don't represent a single instruction.
    But, JMP/JSR/BRA/BSR are singular operations (and BRA/BSR both map to
    JAL on the RV side, differing as to whether Rd is X0 or X1; similarly
    with both JMP and JSR mapping to JALR in a similar way).

    BSR and JSR had been modified to allow arbitrary link register, but it
    may make sense to reverse this; as Rd other than X0 and X1 is seemingly
    pretty much never used in practice (so not really worth the logic cost).


    Other option being to trap and (potentially) emulate, if Rd is not X0 or
    X1 (or just ignore it). Also, very possible, is demoting basically the
    entire RV 'A' extension to "trap and emulate".

    So, in HW:
    RV64I : Fully
    M : Mostly
    A : Trap/Emulate
    F/D : Partial (many cases are traps)
    Zicsr : Partial (trap in general case)
    Zifence: Trap
    ...


    where, say, ALU gets a 6-bit control value:
    (3:0): Which basic operation to perform;
    (5:4): In one of several ways:
    00: 32-bit, sign-ext result (eg: ADDW in RV terms)
    01: 32-bit, zero-ext result (eg: ADDWU in RV terms)
    10: 64-bit (ADD)
    11: 2x 32-bit (some ops) or 4x 16-bit (some other ops)
    PADD.L or PADD.W.

    There is CONV/CONV2/CONV3:
    CONV: Simple 2R converter ops which may have 1-cycle latency
    (later demoted to 2-cycle, with moV being relocated elsewhere).
    CONV2: More complex 2R converter ops, 2 cycle latency.
    CONV3: Same as CONV2, but because CONV2 ran out of space.


    Still no real mechanism to deal with the potential for proliferation of
    ".UW" instructions in RISC-V, for now I had been ignoring this.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Oct 16 16:26:27 2025
    From Newsgroup: comp.arch

    On 10/16/2025 2:04 AM, David Brown wrote:
    On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD
    extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on different hardware.  With SIMD, you need different code if your processor can add
    4 ints at a time, or 8 ints, or 16 ints - it's all different
    instructions using different SIMD registers.  With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
    not exposed to the ISA and you have the same code no matter how wide the actual execution units are.  I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind.  It is akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.


    But, there is problem:
    Once you go wider than 2 or 4 elements, cases where wider SIMD brings
    more benefit tend to fall off a cliff.

    More so, when you go wider, there are new problems:
    Vector Masking;
    Resource and energy costs of using wider vectors;
    ...

    Then, for 'V':
    In the basic case, it effectively doubles the size of the register file
    vs 'G';
    ...


    Then We have x86 land:
    SSE: Did well;
    AVX256: Rocky start, negligible benefit from the YMM registers;
    Using AVX encodings for 128-bit vectors being arguably better.
    AVX512: Sorta exists, but:
    Very often not supported;
    Trying to use it (on supported hardware) often makes stuff slower.

    If even Intel can't make their crap work well, I am skeptical.

    While arguably GPUs were very wide, it is different:
    They were often doing very specialized tasks (such as 3D rendering);
    And, often with a SIMT model rather than "very large SIMD";
    Things like CUDA (and RTX) actually push things narrower;
    Larger numbers of narrower cores,
    rather than smaller number of wider cores.
    ...


    The one area that doesn't seem to run into a diminishing returns wall
    seems to be to map "embarrassingly parallel" problems to large numbers
    of processor cores, and to try to keep things as loosely coupled as
    possible.

    This works mostly until the CPU runs out of memory bandwidth or similar.



    Also there might be some pipeline benefits in having longer vector
    operands ... I’ll bow to your opinion on that.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the “R” stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.


    Agreed, this is more the stance I take.

    Instructions should be simple for hardware and to try to allow for low latency. Rather than trying to make the instruction listing small.



    Though, that said, I still did end up in my case making most
    instructions have a 2 or 3 cycle latency.

    So, generally, MOV-RR and MOV-IR end up as basically about the only single-cycle instructions. A case could almost be made for making *all* instructions 2 or 3 cycles and then eliminate forwarding from EX1
    entirely (or maybe add an EX4 stage).

    Say:
    PF IF ID RF E1 E2 E3 WB
    FW from E2 and E3
    RAW hazard between RF and E1 always stalls.
    Or:
    PF IF ID RF E1 E2 E3 E4 WB
    FW from E2, E3, and E4.

    With an E4 stage, one could maybe allow for pipelined low-precision FMAC
    or similar.


    Though, I see it more as the ISA not actively hindering achieving >= 1
    IPC throughput, rather than instructions having 1 cycle latency.

    But, can note that having 2 cycle latency does hinder the efficiency of
    some common patterns in RISC-V, where tight register RAW dependencies
    run rampant.

    So, say, you ideally want 5-8 instructions between each instruction and
    the next instruction that uses the result. This typically does not
    happen in most code, and particularly not if one needs instruction
    chains for semi-common idioms (say, where the optimal instruction
    scheduling would far exceed the length of a typical loop body).

    For better or worse does tend to result in a lot of performance
    sensitive code being written to use fairly heavy-handed loop unrolling
    though.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Oct 16 21:52:22 2025
    From Newsgroup: comp.arch


    EricP <[email protected]> posted:

    MitchAlsup wrote:
    EricP <[email protected]> posted: ---------------------------
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That >>> is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. >>> However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of >>> Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would >>> not call them spinless idiots because of it. I'd call them "slick
    businessmen" which in my book is less derogatory.
    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    Yes, order in OoO is sanity-impairing.

    But, what you do know is that CHKx will be performed before LD can
    retire. _AND_ if your µA does not update µA state prior to retire,
    you can be as OoO as you like and still not be Spectré sensitive.

    One of the things recently put into My 66000 is that AGEN detects
    overflow and raises PageFault.

    My idea is for the CHKcc instruction to copy the test value to a dest
    register when the check is successful. This makes the dest value register >> write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    If you follow my rule above this is unnecessary, but it may be less
    painful than holding back state update until retire.

    My idea is the same as a SUB instruction with overflow detect,
    which I would already have. I like cheap solutions.

    But the core idea here, to eliminate a control flow race condition by changing it to a data flow dependency, may be applicable in other areas.

    This adds unnecessary execution latency to the architectural path.
    Without the check you have <say> 3-cycle unchecked LD
    With the check you have 4-cycle checked LD

    Now get some multi-pointer chasing per iteration algorithm in a loop and
    all of a sudden the execution window is no longer big enough to run it at
    full speed.

    Because there is no branch, there is no way to speculate around the check >> (but load value speculation could negate this fix).

    On second thought, no, load value speculation would not negate this fix.

    x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
    and µfaults when shift count == 0 and prevents setting of CFLAGS.
    You "COULD" do something similar at µA level.

    I'd prefer not to step in that cow pie to begin with.

    Just making sure you remain aware of the cow-pies littering the field...

    Then I won't have to spend time cleaning my shoes afterwards.

    I am more worried about the blood on the shoes than the cow-pie.
    {{shooting oneself in the foot}}
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Oct 16 21:59:14 2025
    From Newsgroup: comp.arch


    Terje Mathisen <[email protected]> posted:

    MitchAlsup wrote:

    Terje Mathisen <[email protected]> posted:
    ----------------------

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some
    zero-seeking algorithms will stabilize on an exact value, if and only if >> you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I know of several Newton-Raphson-iterations that converge faster and
    more accurately using reciprocal-SQRT() than the equivalent algorithm
    using SQRT() directly in NR-iteration.

    I.e. they differ by exactly one ulp.

    In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

    Interesting! I have also found rsqrt() to be a very good building block,
    to the point where if I can only have one helper function (approximate lookup to start the NR), it would be rsqrt, and I would use it for all
    of sqrt, fdiv and rsqrt.

    In practice:: RSQRT() is no harder to compute {both HW and SW},
    yet:: RSQRT() is more useful::

    SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
    RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV

    Useful in vector normalization::

    some-vector-calculation
    -----------------------
    SQRT( SUM(x**2,1,n) )

    and a host of others.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Oct 16 22:19:21 2025
    From Newsgroup: comp.arch


    David Brown <[email protected]> posted:

    On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up the instruction opcode space with a combinatorial explosion. (Or sequence of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
    4 ints at a time, or 8 ints, or 16 ints - it's all different
    instructions using different SIMD registers.

    Among SIMD's ISA problems is additional state at context switch time
    on top of FP's added state at context switch time; but with all the
    fast memory move subroutines being SIMD-based--the service routines
    need access to SIMD that they don't normally need for FP {and the
    SIMD register file is larger, too}

    With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
    not exposed to the ISA and you have the same code no matter how wide the actual execution units are.

    Vector LD and ST instructions are not conceptually different than
    LDM and STM--1 instruction accesses multiple memory locations.

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve
    many memory aliasing issues to use the vector ISA.

    Software writes vector loops--yet the HW vectorizes instructions.

    {{I might note My 66000 vectorizes loops not instructions to avoid
    this problem; For example::

    for( i = 0; i < max; i++ )
    {
    temp = a[i];
    a[i] = a[max-i];
    a[max-i] = temp;
    }

    is vectorizable in My 66000--those loops where the memory references
    do not overlap can run "as fast as the width of the data path allow"
    while those with memory reference collisions run no worse than scalar
    code. For a large value of max the profile would look like::

    FFFFFFFFFFFFFFFFFsssFFFFFFFFFFFFFFFFF

    F representing fast (say 4-wide or 8-wide)
    s representing slow (say 1-wide)

    The same binary runs as fast as memory references (and data-flow
    dependencies and data-path width) allow.
    }}

    I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
    akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.


    Also there might be some pipeline benefits in having longer vector
    operands ... I’ll bow to your opinion on that.

    CRAY-like vector computers built memory systems that could handle the load
    of the vector calculations. CRAY-1 could perform a new memory access every clock, CRAY-[XY]MP could handle 2 LDs and 1 ST per clock continuously.

    If those CPUs of today were really going to fully utilize the vector
    data-path, they are going to have to have a lot better memory system
    than they are building presently (1 new cache miss per cycle).

    The power of the vector computers was almost entirely in the memory system
    not in the data path (which is surprisingly easy to build, and surprisingly difficult to keep fed).

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the “R” stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.

    On vacation over the summer, I canned a new phrase to denote what I
    hope My 66000 will end up being::

    CARD Computer Architecture Rightly Done.

    Note: It does not stop at ISA--as ISA is less than 1/3rd of what a
    computer architecture is and means.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Thu Oct 16 23:13:58 2025
    From Newsgroup: comp.arch



    Hope the attributions are correct.


    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
    <[email protected]d> wrote:


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    :
    In any case, even with these languages there are still software projects >> > that fail, miss their deadlines and have overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out how to >make the (17 kinds of) hammers one needs, there it little need to make a
    new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have been >happier... The mouse was more precise in W7 than in W8 ... With a little >upgrade for new PCIe architecture along the way rather than redesigning
    whole kit and caboodle for tablets and phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
    and few people would have cared. Many SW projects are driven not by demand >for the product, but pushed by companies to make already satisfied users
    have to upgrade.

    Those programmers could have transitioned to new SW projects rather than >redesigning the same old thing 8 more times. Presto, there is now enough
    well trained SW engineers to tackle the undone SW backlog.

    The problem is that decades of "New & Improved" consumer products have conditioned the public to expect innovation (at minimum new packaging
    and/or advertising) every so often.

    Bringing it back to computers: consider that a FOSS library which
    hasn't seen an update for 2 years likely would be passed over by many
    current developers due to concern that the project has been abandoned.
    That perception likely would not change even if the author(s)
    responded to inquiries, the library was suitable "as is" for the
    intended use, and the lack of recent updates can be explained entirely
    by a lack of new bug reports.

    Why take a chance? There simply _must_ be a similar project somewhere
    else that still is actively under development. Even if it's buggy and unfinished, at least someone is working on it.


    YMMV but, as a software developer myself, this attitude makes me sick.
    8-(
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 06:48:27 2025
    From Newsgroup: comp.arch

    On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve many memory aliasing issues to use the vector ISA.

    Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 06:51:18 2025
    From Newsgroup: comp.arch

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive the (otherwise universal) transition
    to RISC was kept afloat through high revenues and high margins, which
    allowed the company to spend the much higher sums needed to add all the
    extra millions of transistors necessary to keep performance competitive.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 06:53:16 2025
    From Newsgroup: comp.arch

    On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:

    Looking at

    The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

    he never says what defines RISC, just what improved results this *design approach* should achieve.

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set Computer” -- because the one feature that really did become common was the larger register sets.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 07:03:16 2025
    From Newsgroup: comp.arch

    On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    I know, you could consider that a cheat in some ways. But on the other
    hand, it allows code reuse, by having different (overloaded) function
    entry points each do type-specific setup, then all branch to common code
    to execute the actual loop bodies.

    Most use-cases for longer vectors tend to matrix-like rather than vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Back in the days of Seymour Cray, his machines were getting useful results
    out of vector lengths up to 64 elements.

    Perhaps that was more a substitute for parallel processing.

    There are some weaknesses, for example, I mostly ended up dealing with
    RGB math by simply repeating the 8-bit values twice within a 16-bit
    spot.

    Maybe it’s time to look beyond RGB colours. I remember some “Photo” inkjet
    printers had 5 or 6 different colour inks, to try to fill out more of the
    CIE space. Computer monitors could do the same. Look at the OpenEXR image format that these CG folks like to use: that allows for more than 3 colour components, and each component can be a float -- even single-precision
    might not be enough, so they allow for double precision as well.

    BSR and JSR had been modified to allow arbitrary link register, but it
    may make sense to reverse this; as Rd other than X0 and X1 is seemingly pretty much never used in practice (so not really worth the logic cost).

    POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
    subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to
    address in CTR and leave return address in LR).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Fri Oct 17 13:54:50 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:
    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive
    There are two of them..
    the (otherwise universal)
    transition to RISC was kept afloat through high revenues and high
    margins, which allowed the company to spend the much higher sums
    needed to add all the extra millions of transistors necessary to keep performance competitive.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Fri Oct 17 13:59:33 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:
    On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:

    Looking at

    The Case for the Reduced Instruction Set Computer, 1980, David
    Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

    he never says what defines RISC, just what improved results this
    *design approach* should achieve.

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set Computer” -- because the one feature that really did become common
    was the larger register sets.
    Larger register sets were common, but not universal.
    Load/store architecture was (with allowance for exceptions for
    synchronization primitives that are not expected to be as fast as
    normal instructions) appears to be universal.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Fri Oct 17 14:31:46 2025
    From Newsgroup: comp.arch

    On 16/10/2025 23:26, BGB wrote:
    On 10/16/2025 2:04 AM, David Brown wrote:
    On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single
    primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or >>>> vice versa)--they simply represent different ways of shooting yourself >>>> in the foot.

    The primary design criterion, as I understood it, was to avoid
    filling up
    the instruction opcode space with a combinatorial explosion. (Or
    sequence
    of combinatorial explosions, when you look at the wave after wave of
    SIMD
    extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on
    different hardware.  With SIMD, you need different code if your
    processor can add 4 ints at a time, or 8 ints, or 16 ints - it's all
    different instructions using different SIMD registers.  With the
    vector style instructions in RISC-V, the actual SIMD registers and
    implementation are not exposed to the ISA and you have the same code
    no matter how wide the actual execution units are.  I have no
    experience with this (or much experience with SIMD), but that seems
    like a big win to my mind.  It is akin to letting the processor
    hardware handle multiple instructions in parallel in superscaler cpus,
    rather than Itanium EPIC coding.


    But, there is problem:
    Once you go wider than 2 or 4 elements, cases where wider SIMD brings
    more benefit tend to fall off a cliff.

    More so, when you go wider, there are new problems:
      Vector Masking;
      Resource and energy costs of using wider vectors;
      ...


    I appreciate that. Often you will either be wanting the operations to
    be done on a small number of elements, or you will want to do it for a
    large block of N elements which may be determined at run-time. There
    are some algorithm, such as in cryptography, where you have sizeable but fixed-size blocks.

    When you are dealing with small, fixed-size vectors, x86-style SIMD can
    be fine - you can treat your four-element vectors as single objects to
    be loaded, passed around, and operated on. But when you have a large
    run-time count N, it gets a lot more inefficient. First you have to
    decide what SIMD extensions you are going to require from the target,
    and thus how wide your SIMD instructions will be - say, M elements.
    Then you need to loop N / M times, doing M elements at a time. Then you
    need to handle the remaining N % M elements - possibly using smaller
    SIMD operations, possibly doing them with serial instructions (noting
    that there might be different details in the implementation of SIMD and
    serial instructions, especially for floating point).

    The resulting code is big, ugly, tuned to specific targets (it will be
    slower than optimal if run on a target with wider SIMD, and won't run at
    all on a target with narrower SIMD), and have huge overhead if it
    happens to be run with a small N. Oh, and it might not work - or work
    less efficiently - if the data alignments are not ideal.

    Vector processing avoids pretty much all of those disadvantages.

    Just try writing a loop function in godbolt.org, and compile it with x86
    clang or gcc -O3 -march=rocketlake, and compare the results to compiling
    it for risc-v with -march=rv64gv.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Fri Oct 17 14:38:08 2025
    From Newsgroup: comp.arch

    On 17/10/2025 08:48, Lawrence D’Oliveiro wrote:
    On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve many
    memory aliasing issues to use the vector ISA.

    Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?

    "restrict" can significantly improve non-vectored code too, as well as
    more "ad-hoc" vectoring of code where the compiler uses general-purpose registers, but interlaces loads, stores and operations to improve
    pipelining. But it is certainly a very useful qualifier for vector code.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 13:00:48 2025
    From Newsgroup: comp.arch

    On 10/17/2025 5:54 AM, Michael S wrote:
    On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive

    There are two of them..


    AFAIK:
    x86 / x86-64: Alive and well in PCs.
    6502: Now dead (no more 6502's being made)
    65C816: Still holding on (niche), backwards compatible with 6502.
    Z80: Dead
    M68K: Mostly Dead
    NXP ColdFire: Still lives (Simplified M68K).
    MSP430: Still Lives (I classify it as a CISC).
    IBM S/360: Dead on real HW
    Lives on in emulation.



    In looking around, I noted that apparently my VUGID/ACLID idea isn't
    entirely novel. Apparently similar existed in S/360 and IA-64 under the
    name of "Protection Keys".

    Then again, the origin of this idea is my case was basically "borrowed"
    from the "Tron 2.0" game, which presented a similar idea in the game (to justify why doors could be locked, a normal game mechanic), and I was
    left thinking "Why not?..."

    Well, apparently real HW did do this, just not x86 or ARM or similar...


    the (otherwise universal)
    transition to RISC was kept afloat through high revenues and high
    margins, which allowed the company to spend the much higher sums
    needed to add all the extra millions of transistors necessary to keep
    performance competitive.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 11:49:03 2025
    From Newsgroup: comp.arch

    On 10/17/2025 11:00 AM, BGB wrote:
    On 10/17/2025 5:54 AM, Michael S wrote:
    On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive

    There are two of them..


    AFAIK:
      x86 / x86-64: Alive and well in PCs.
      6502: Now dead (no more 6502's being made)
        65C816: Still holding on (niche), backwards compatible with 6502.
      Z80: Dead
      M68K: Mostly Dead
        NXP ColdFire: Still lives (Simplified M68K).
      MSP430: Still Lives (I classify it as a CISC).
      IBM S/360: Dead on real HW
        Lives on in emulation.

    As I am sure others will verify, the compatible descendants of the S/360
    are alive in real hardware. While I expect there haven't been any "new
    name" customers in a long time, the fact that IBM still introduces new
    chips every few years indicates that there is still a market for this architecture, presumably by existing customer's existing workload
    growth, and perhaps new applications related to existing ones.

    Some of the original BUNCH architectures do live on in emulation
    (Burroughts, Univac, Honeywell). I believe the other two, CDC and NCR
    are dead.

    I expect that all of the minicomputer age architectures are dead.

    There also were lots of microcomputer "chip" architectures that are dead (National Semi, ATT, Fairchild, etc.), but I don't necessarily attribute
    that to being overtaken by RISC architectures.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 14:43:50 2025
    From Newsgroup: comp.arch

    On 10/17/2025 1:49 PM, Stephen Fuld wrote:
    On 10/17/2025 11:00 AM, BGB wrote:
    On 10/17/2025 5:54 AM, Michael S wrote:
    On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive

    There are two of them..


    AFAIK:
       x86 / x86-64: Alive and well in PCs.
       6502: Now dead (no more 6502's being made)
         65C816: Still holding on (niche), backwards compatible with 6502. >>    Z80: Dead
       M68K: Mostly Dead
         NXP ColdFire: Still lives (Simplified M68K).
       MSP430: Still Lives (I classify it as a CISC).
       IBM S/360: Dead on real HW
         Lives on in emulation.

    As I am sure others will verify, the compatible descendants of the S/360
    are alive in real hardware.  While I expect there haven't been any "new name" customers in a long time, the fact that IBM still introduces new
    chips every few years indicates that there is still a market for this architecture, presumably by existing customer's existing workload
    growth, and perhaps new applications related to existing ones.


    OK.

    I had thought it was the idea that IBM kept running the original ISA,
    but as an emulation layer on top of POWER rather than as the real
    hardware level ISA.


    Some of the original BUNCH architectures do live on in emulation (Burroughts, Univac, Honeywell).  I believe the other two, CDC and NCR
    are dead.

    I expect that all of the minicomputer age architectures are dead.

    There also were lots of microcomputer "chip" architectures that are dead (National Semi, ATT, Fairchild, etc.), but I don't necessarily attribute that to being overtaken by RISC architectures.



    OK.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 13:10:25 2025
    From Newsgroup: comp.arch

    On 10/17/2025 12:43 PM, BGB wrote:
    On 10/17/2025 1:49 PM, Stephen Fuld wrote:
    On 10/17/2025 11:00 AM, BGB wrote:
    On 10/17/2025 5:54 AM, Michael S wrote:
    On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive

    There are two of them..


    AFAIK:
       x86 / x86-64: Alive and well in PCs.
       6502: Now dead (no more 6502's being made)
         65C816: Still holding on (niche), backwards compatible with 6502. >>>    Z80: Dead
       M68K: Mostly Dead
         NXP ColdFire: Still lives (Simplified M68K).
       MSP430: Still Lives (I classify it as a CISC).
       IBM S/360: Dead on real HW
         Lives on in emulation.

    As I am sure others will verify, the compatible descendants of the
    S/360 are alive in real hardware.  While I expect there haven't been
    any "new name" customers in a long time, the fact that IBM still
    introduces new chips every few years indicates that there is still a
    market for this architecture, presumably by existing customer's
    existing workload growth, and perhaps new applications related to
    existing ones.


    OK.

    I had thought it was the idea that IBM kept running the original ISA,
    but as an emulation layer on top of POWER rather than as the real
    hardware level ISA.

    I have heard that idea several times before. I wonder where it came from?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 15:32:39 2025
    From Newsgroup: comp.arch

    On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:
    On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    I know, you could consider that a cheat in some ways. But on the other
    hand, it allows code reuse, by having different (overloaded) function
    entry points each do type-specific setup, then all branch to common code
    to execute the actual loop bodies.


    The SuperH also did this for the FPU:
    Didn't have enough encoding space to fit everything, so they sorta used
    FPU control bits to control which instructions were decoded.

    Most use-cases for longer vectors tend to matrix-like rather than
    vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Back in the days of Seymour Cray, his machines were getting useful results out of vector lengths up to 64 elements.

    Perhaps that was more a substitute for parallel processing.


    Maybe.
    Just in my own experience, it seems to fizzle out pretty quickly.

    Typically the combination of diminishing returns as cost reaches for the
    sky.

    It may not count for Cray though, since IIRC their vectors were encoded
    as memory-addresses and they were effectively using pipelining tricks
    for the vectors.


    So, in this case, a truer analog of Cray style vectors would not be
    variable width SIMD that can fake large vectors, but rather a mechanism
    to stream the vector through a SIMD unit.

    In my case, though, to have any real advantage over the existing SIMD, I
    would effectively need a wider memory interface (say, one capable of
    doing 2 loads and 1 store per cycle). If limited to 1 memory access per
    cycle, it would still be effectively limited to ~ 1 element/cycle on
    average (or maybe 2 elements/cycle with Binary16; since I could
    effectively load/store 128 bits at a time, assuming a SIMD-op
    co-executing with one of the memory ops).


    Ironically, this is one of the merits of FP8 and block-encoding for
    weights in NNs: Can effectively batch the memory accesses (by loading
    larger units) so is slightly less hindered by a 1-access-per-cycle
    limitation.

    Though, even if I did have a wider pipe to memory, there would still the problem of memory bandwidth (to L2 and to external RAM). And, would
    likely need semi-intelligent streaming or prefetching to get more
    effective use out of what DRAM bandwidth exist (RAM access via the L2
    cache being somewhat slower than the raw bandwidth to the RAM chip).

    Though, one trick was used in the L1 cache that helps:
    If the access cache line also misses, and the following cache line
    misses, then handle the misses for both cache lines at the same time (it assumes that the next line address is likely to be accessed in the near future).

    In premise, the L2 cache could use similar logic; though this would
    require more logic, as normally the L2 cache deals with each line access independently (vs the L1 cache which has to deal with the possibility
    line crossing access as a normal part of its operation).


    Well, and the L1 I$ also always requiring both lines to hit (whether or
    not the current fetch crosses a line boundary), vs the L1D$ which only
    needs to stall if the current access misses. A possible optimization
    could be to allow for asynchronous prefetch, but this could lead to more complex scenarios; such as needing to stall for a miss but then wait for
    a preceding in-flight RAM access to finish before the next request could
    be issue. So in this case the L1D$ doesn't allow for asynchronous fetch,
    even if it could be faster. Would otherwise need more complex logic to
    deal with asynchronous memory prefetching in a way that doesn't put
    stability at risk.


    There are some weaknesses, for example, I mostly ended up dealing with
    RGB math by simply repeating the 8-bit values twice within a 16-bit
    spot.

    Maybe it’s time to look beyond RGB colours. I remember some “Photo” inkjet
    printers had 5 or 6 different colour inks, to try to fill out more of the
    CIE space. Computer monitors could do the same. Look at the OpenEXR image format that these CG folks like to use: that allows for more than 3 colour components, and each component can be a float -- even single-precision
    might not be enough, so they allow for double precision as well.


    IME:
    The visible difference between RGB555 and RGB24 is small;
    The difference between RGB24 and RGB30 in mostly imperceptible;
    Though, most modern LCD/LED monitors actually only give around 5 or 6
    bits per color channel (unlike the true analog on VGA CRTs, *).

    *: The better solution to possible banding issues being not so much to
    use more color depth, but rather to dither. Though, AFAIK a lot of LCD
    panels have built-in dithering, so rather than seeing either true RGB24,
    or an more obviously banded RGB555 or RGB666 approximation, the monitor
    will show a representation with a Bayer dither or similar applied (which
    is mostly not noticeable unless one looks very closely).


    For HDR:
    3x E4.F4 is pretty comparable to RGB555 in terms of quality;
    2x Binary16 is plenty.

    Binary32 or Binary64 seems like serious overkill for HDR image storage.


    Well, and then there is the R11_G11_B10 format:
    R=E5.M6, G=E5.M6, B=E5.M5

    Which is possibly a better option:
    Will match/exceed display quality while still allowing HDR, and more
    compact storage than 3x Binary16.

    Or, RGB9_E5, ...

    One other traditional HDR format is RGB8_E8, but this has its own wonk.


    Though, within existing monitors or computers, little can be done to
    improve over RGB.


    Had noted though that for me, IRL, monitors can't really represent real
    life colors. Like, I live in a world where computer displays all have a
    slight tint (with a similar tint and color distortion also applying to
    the output of color laser printers; and a different color distortion for inkjet printers).

    So, it is like:
    Real life, computers, and inkjet printers, all exist in similar but
    different worlds in terms of color display.

    Well, also LED bulbs, particularly cheap ones or multi-color ones,
    tending to make everything look computer-like (bleh; I actually prefer
    the look of CFLs over this; or halogen bulbs which can at least make a
    proper white light...).

    Had noted when messing around with LEDs, that one generally needs 4 LEDS
    to get something that looks like natural white light.

    IME, I could get this effect with two different schemes:
    R: 675nm
    G: 525nm
    H: 480nm
    B: 440nm
    And, with more readily available LEDs:
    R: 675nm
    G: 525nm
    H: 465nm
    B: 400nm
    Where, either 480nm+440nm or 465nm+400nm can allow for something
    resembling pure white. Can sort of approximate real colors by setting
    the H value to a blend of G and B.

    Comparably, 456nm and 400nm LEDs are easier to find, but proper 440nm
    and 480nm LEDs are a pain to find (where, 480nm is sort of a unique
    color that doesn't really exist on computer displays).

    Can note that 400nm looks different on phone vs real life:
    Phone sees it as a pinkish color;
    Real life, it looks like a very strong blue (similar to 440nm).
    The 465nm LED is a little closer to the sky in real life, but not a good
    match for "blue" on computer displays (usually closer to 440nm).

    But, neither really match cyan or azure on computers, which for me is a different color (more of a separate mixing of green and blue).

    But, partly it is a case of, "meh, it is what it is". Oddly no one else
    really seems to notice the issue, so, ...




    For my uses, for storing HDR within JPEG or UPIC (1), (a custom, vaguely JPEG-like format), had generally used 3x E4.M4 or similar (which mostly
    works fine, albeit looks a little funky if one looks at the HDR image as linear RGB).

    Where, UPIC is a format sorta like T.81 JPEG, with some changes:
    Huffman -> STF+AdRice
    DCT -> Block-Haar
    Still uses an 8x8 transform organized into 16x16 blocks.
    Different VLC scheme (Z3.V5)
    Uses RCT vs YCbCr
    Uses a TLV packaging scheme.
    Mostly TWOCC's, with lengths stored inverted.
    The scheme allowed a nice way to allow variable tag/length sizes.
    Placed more limits on the allowed subsampling modes:
    4:2:0, 4:4:4 (RGB)
    4:2:0:4, 4:4:4:4 (RGBA)
    4:0:0, 4:0:0:4 (Monochrome, Monochrome+Alpha)

    Though, there are a few close-calls in terms of optimal choice:
    STF+AdRice vs Huffman with a 13-bit length limit.
    Smaller length limit on Huffman makes it faster.
    Below 12 or 13 severely reduces its effectiveness though.
    However, STF+AdRice needs very little context and has fast setup.
    Also less code vs 13-bit Huffman.
    But, in a strict sense, both speed and compression are worse.
    Block-Haar vs WHT
    Both Block-Haar and WHT are exactly reversible (unlike DCT).
    DCT can be made reversible, but this form is very slow.
    Block-Haar does better with synthetic images, WHT with photos.
    Block-Haar is slightly faster.
    RCT has a close-call with YCoCg.
    RCT was both slightly faster and compressed better in my testing.
    Both are reversible, unlike YCbCr.


    The use of fully reversible transforms does allow also using the format
    in place of PNG. In a PNG-like role, it tends to both often compress
    slightly better, as well as being faster to decode and uses less working
    RAM. Decoding a PNG needs a significant chunk of intermediate memory,
    which can be side-stepped in both UPIC and in an optimized JPEG decoder; though a "generic" JPEG decoder would need more working memory
    (typically needing buffers for decoding the luma and chroma planes for
    the whole image, rather than working "one block at a time").

    Where, for context:
    STF+AdRice:
    Swap-Towards-Front:
    Starts with an initial permutation of all symbols in order;
    Encoding a symbol swaps it towards the front;
    Symbols are encoded as their index in this table;
    Tends to converge towards an optimal ranking.
    AdRice: Adaptive Golomb-Rice Coding
    Can be used in a similar way to Huffman or Adaptive Huffman.
    But, significantly faster than Adaptive Huffman.
    Block Haar: Uses a 2D transform built from 1D transforms, like DCT
    I0..I7 -> O0..O7 basically
    J0=(I0+I1)/2 J1=(I2+I3)/2 J2=(I4+I5)/2 J3=(I6+I7)/2
    J4=I0-I1 J5=I2-I3 J6=I4-I5 J7=I6-I7
    K0=(J0+J1)/2 K1=(J2+J3)/2 K2=J0-J1 K3=J2-J3
    L0=(K0+K1)/2 L1=K0-K1
    O0..O7 = {L0,L1,K2,K3,J4,J5,J6,J7}
    Can use the same ZigZag ordering and Quantization approach as JPEG.
    Albeit with different math for building the quantization matrix.
    Filling the matrix with all 1's allowing for lossless.

    Different choices might be made if the goal was to have maximum
    compression, but I was biased more towards wanting to keep the decoder
    size modest and reasonably fast.

    Though, it is possible to gain compression (at the cost of speed) by
    running the image bitstream bytes through an LZMA style range coder
    (though, a harder problem is making a range-coder fast).


    The change in VLC scheme:
    3 bits encode the run of zeroes (so, can't skip as many zeroes);
    Uses 5 bits for the coefficient value;
    Coefficient uses a similar encoding scheme to Distance values in Deflate; Signed values are zigzag folded: "V1=(V0<<1)^(V0>>31);"

    IME, this is both less awkward and also compresses slightly better than
    the scheme originally used by JPEG. Though, as with JPEG, an 00 symbol
    can encode an early EOB (all remaining coefficients zero).


    Can note that, one scheme I had used elsewhere for Huffman coding was:
    Symbols are limited to 13 bits;
    A shorter limit makes the lookup table smaller;
    So, less setup time, and less L1 misses in decoding.
    Huffman tables are encoded as a series of 4 bit lengths;
    0..D: Symbol Length
    E,x: RLE run of preceding length.
    F,x: RLE run of zeroes.
    Both simpler and cheaper to decode than the scheme used by Deflate.
    While typically being similarly compact.


    BSR and JSR had been modified to allow arbitrary link register, but it
    may make sense to reverse this; as Rd other than X0 and X1 is seemingly
    pretty much never used in practice (so not really worth the logic cost).

    POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
    subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to address in CTR and leave return address in LR).


    Jumping to an arbitrary address can be useful.
    Using whatever random register as a link register, not as much.

    So, nearly always, it is one of:
    X0: Plain branch;
    X1: Branch-with-link.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 20:54:23 2025
    From Newsgroup: comp.arch


    George Neuner <[email protected]> posted:



    Hope the attributions are correct.


    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup <[email protected]d> wrote:


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    :
    In any case, even with these languages there are still software projects >> > that fail, miss their deadlines and have overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out how to >make the (17 kinds of) hammers one needs, there it little need to make a >new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have been >happier... The mouse was more precise in W7 than in W8 ... With a little >upgrade for new PCIe architecture along the way rather than redesigning >whole kit and caboodle for tablets and phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
    and few people would have cared. Many SW projects are driven not by demand >for the product, but pushed by companies to make already satisfied users >have to upgrade.

    Those programmers could have transitioned to new SW projects rather than >redesigning the same old thing 8 more times. Presto, there is now enough >well trained SW engineers to tackle the undone SW backlog.

    The problem is that decades of "New & Improved" consumer products have conditioned the public to expect innovation (at minimum new packaging
    and/or advertising) every so often.

    Bringing it back to computers: consider that a FOSS library which
    hasn't seen an update for 2 years likely would be passed over by many
    current developers due to concern that the project has been abandoned.
    That perception likely would not change even if the author(s)
    responded to inquiries, the library was suitable "as is" for the
    intended use, and the lack of recent updates can be explained entirely
    by a lack of new bug reports.

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.

    Most Floating Point Libraries are in a similar position. They were
    updated after IEEE 754 became widespread and are as good today as
    ever.

    {FF1, Tomography, CFD, FEM} have needed no real changes in decades.

    Sometimes, Software is "done". You may add things to the package
    {like a new crescent wrench} but the old hammer works just as well
    today as 30 years ago when you bought it.

    Why take a chance?

    On the last day of SW support for W10--they (THEY) updated several
    things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!

    To the SW vendor, they want to be able to update their SW any time
    they want. Yet, the application user wants the same bugs to remain
    constant over the duration of the WHOLE FRIGGEN project--because
    once you found them and figured a way around them, you don't want
    them to reappear somewhere else !!!

    There simply _must_ be a similar project somewhere
    else that still is actively under development. Even if it's buggy and unfinished, at least someone is working on it.

    I understand--but this bites more often than the conservative approach.

    YMMV but, as a software developer myself, this attitude makes me sick.
    8-(

    I was in a 3-year project where we had to forgo upgrading from SunOS
    to Solaris because the SW license model changes would have put us out
    of business before project completion.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 20:55:51 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive the (otherwise universal) transition to RISC was kept afloat through high revenues and high margins, which allowed the company to spend the much higher sums needed to add all the extra millions of transistors necessary to keep performance competitive.

    Never underestimate the work designers can do when given cubic dollars
    of budget under which to work.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 20:59:10 2025
    From Newsgroup: comp.arch


    Michael S <[email protected]> posted:

    On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

    If even Intel can't make their crap work well, I am skeptical.

    The only CISC architecture to survive

    There are two of them..

    Only one selling more than 1M per month.

    the (otherwise universal)
    transition to RISC was kept afloat through high revenues and high
    margins, which allowed the company to spend the much higher sums
    needed to add all the extra millions of transistors necessary to keep performance competitive.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 15:59:51 2025
    From Newsgroup: comp.arch

    On 10/17/2025 1:48 AM, Lawrence D’Oliveiro wrote:
    On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve many
    memory aliasing issues to use the vector ISA.

    Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?

    Ironically, this is also partly why I suspect if a C-like language could
    have a "T[]" type that was distinct from "T*" could be useful, even if
    they were the same representation internally (a bare memory pointer):
    "T[]" could be safely assumed to never alias except in cases where it
    could two references to the same array (in which case, they will only
    alias if the same index; and this likely only matter if the inputs and
    outputs are potentially the same array).

    Though, in C, "int arr[]" as an argument is regarded as equivalent to
    "int *arr", so no useful conclusions could be drawn in this case
    (ideally one would need a language where implicit conversion from
    "T*"->"T[]" is an error, and implicit conversion from "T[]"->"T*" is a warning).



    But, alas...

    At least in theory "restrict" works, when people use it.

    Though, "assume TBAA as the universal default" makes some cases faster,
    while screwing over some other use cases; and still doesn't fully
    resolve the type-alias issue as one still can't assume non-alias in
    cases when types are the same.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 16:15:33 2025
    From Newsgroup: comp.arch

    On 10/17/2025 5:59 AM, Michael S wrote:
    On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:

    Looking at

    The Case for the Reduced Instruction Set Computer, 1980, David
    Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

    he never says what defines RISC, just what improved results this
    *design approach* should achieve.

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common
    was the larger register sets.

    Larger register sets were common, but not universal.
    Load/store architecture was (with allowance for exceptions for synchronization primitives that are not expected to be as fast as
    normal instructions) appears to be universal.


    Yeah.

    Otherwise, RISC-V's 'A' extension (which is a serious violation of
    Load/Store) would be a bigger problem.

    But, I have since realized that (because GCC/etc never really uses these instructions for general code generation) one can roll-back on native
    hardware support and handle them as traps...


    Granted, the preferable option in this case is to have something like "MutexLock()" or "EnterCriticalSection()" as a system call (as, one a
    machine without true atomic operations, and with weak memory coherence,
    a system call that is aware of the actual HW behavior is preferable to
    just trying to fake it in trap handlers).

    Well, and on a single core system, it reduces to a single choice:
    Caller locks mutex, return to caller;
    Mutex can't be locked right now,
    flag it and schedule another task
    (and hope mutex unlocks eventually).


    Well, contrast to using spinlocks in userland, which only really makes
    sense if one assumes:
    There are multiple cores;
    Memory accesses are sequentially consistent between threads.

    And, if the implementation needs to trap on a FENCE or similar, it has
    already lost.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:07:18 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

    On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common was
    the larger register sets.

    Larger register sets were common, but not universal.

    Where is there an architecture you would class as “RISC”, but did not have a “large” register set?

    (How “large” is “large”? The VAX had 16 registers; was there any RISC architecture with only that few?)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:20:49 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:

    On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

    On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    I know, you could consider that a cheat in some ways. But on the other
    hand, it allows code reuse, by having different (overloaded) function
    entry points each do type-specific setup, then all branch to common
    code to execute the actual loop bodies.

    The SuperH also did this for the FPU:
    Didn't have enough encoding space to fit everything, so they sorta used
    FPU control bits to control which instructions were decoded.

    That was probably not cost-effective for scalar instructions, because it
    would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.

    Probably better for vector instructions, where one sequence of operand
    type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.

    Most use-cases for longer vectors tend to matrix-like rather than
    vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Back in the days of Seymour Cray, his machines were getting useful
    results out of vector lengths up to 64 elements.

    Perhaps that was more a substitute for parallel processing.

    Maybe. Just in my own experience, it seems to fizzle out pretty quickly.

    Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively-
    parallel supers, or RISC machines etc. Maybe the parallelism was thought
    to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)

    Short-vector SIMD was introduced along an entirely separate evolutionary
    path, namely that of bringing DSP-style operations into general-purpose
    CPUs.

    It may not count for Cray though, since IIRC their vectors were encoded
    as memory-addresses and they were effectively using pipelining tricks
    for the vectors.

    Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the
    two were bound to have underlying features in common.

    So, in this case, a truer analog of Cray style vectors would not be
    variable width SIMD that can fake large vectors, but rather a mechanism
    to stream the vector through a SIMD unit.

    But short-vector SIMD can only deal with operands in lockstep. If you
    loosen this restriction, then you are back to multiple function units and superscalar execution.

    Maybe it’s time to look beyond RGB colours. I remember some “Photo”
    inkjet printers had 5 or 6 different colour inks, to try to fill out
    more of the CIE space. Computer monitors could do the same. Look at the
    OpenEXR image format that these CG folks like to use: that allows for
    more than 3 colour components, and each component can be a float --
    even single-precision might not be enough, so they allow for double
    precision as well.


    IME:
    The visible difference between RGB555 and RGB24 is small;
    The difference between RGB24 and RGB30 in mostly imperceptible;
    Though, most modern LCD/LED monitors actually only give around 5 or 6
    bits per color channel (unlike the true analog on VGA CRTs, *).

    First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.

    Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
    imagery, the better the quality of the output.

    Sure, you may think 64-bit floats must be overkill for this purpose; but
    these are artists you’re dealing with. ;)

    Had noted though that for me, IRL, monitors can't really represent real
    life colors. Like, I live in a world where computer displays all have a slight tint (with a similar tint and color distortion also applying to
    the output of color laser printers; and a different color distortion for inkjet printers).

    That is always true; “white” is never truly “white”, which is why those
    who work in colour always talk about a “white point” for defining what is meant by “white”, which is the colour of a perfect “black body” emitter at
    a specific temperature (typically 5500K or above).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:21:31 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 20:55:51 GMT, MitchAlsup wrote:

    Never underestimate the work designers can do when given cubic dollars
    of budget under which to work.

    “Cubic dollars” ... I like that. ;)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:24:23 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 14:43:50 -0500, BGB wrote:

    I had thought it was the idea that IBM kept running the original ISA,
    but as an emulation layer on top of POWER rather than as the real
    hardware level ISA.

    I’ve been told that’s wrong about zArchitecture, but it is true for iArchitecture (current incarnation of AS/400 -- yes, apparently that’s
    still around in a small way, too).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 22:52:55 2025
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 10/17/2025 1:48 AM, Lawrence D’Oliveiro wrote:
    On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve many >> memory aliasing issues to use the vector ISA.

    Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?

    Ironically, this is also partly why I suspect if a C-like language could have a "T[]" type that was distinct from "T*" could be useful, even if
    they were the same representation internally (a bare memory pointer):
    "T[]" could be safely assumed to never alias

    Restrict does nothing to make:: (from a few days ago)
    {
    I might note My 66000 vectorizes loops not instructions to avoid
    this problem; For example::

    for( i = 0; i < max; i++ )
    {
    temp = a[i];
    a[i] = a[max-i];
    a[max-i] = temp;
    }

    }

    Runs fast and get the right answer. When i !~= max, the loop
    runs at vector speeds, when i ~= (max-i) it runs slow to get
    the right answer. Where ~= is within a cache line.

    At least in theory "restrict" works, when people use it

    under the specification restrict has

    .
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Fri Oct 17 19:52:05 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 10/17/2025 12:43 PM, BGB wrote:
    On 10/17/2025 1:49 PM, Stephen Fuld wrote:

    As I am sure others will verify, the compatible descendants of the
    S/360 are alive in real hardware. While I expect there haven't been
    any "new name" customers in a long time, the fact that IBM still
    introduces new chips every few years indicates that there is still a
    market for this architecture, presumably by existing customer's
    existing workload growth, and perhaps new applications related to
    existing ones.


    OK.

    I had thought it was the idea that IBM kept running the original ISA,
    but as an emulation layer on top of POWER rather than as the real
    hardware level ISA.

    I have heard that idea several times before. I wonder where it came from?

    The AS400 cpu was replaced by Power and an emulation layer. https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

    The z-series was always a different cpu, but maybe they
    shared development groups with Power. The stages of the
    z15 core (2019) doesn't look anything like Power10 (2021).

    https://www.servethehome.com/wp-content/uploads/2020/08/Hot-Chips-32-IBM-Z15-Processor-Pipeline.jpg

    https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-microarchitecture-block-diagram/
    https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-microarchitecture-core-flexibility/




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Oct 18 00:37:43 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

    On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common was >> the larger register sets.

    Larger register sets were common, but not universal.

    Where is there an architecture you would class as “RISC”, but did not have
    a “large” register set?

    See Univac 1108

    (How “large” is “large”? The VAX had 16 registers; was there any RISC
    architecture with only that few?)

    Clipper.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Oct 18 00:42:27 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:

    On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

    On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    I know, you could consider that a cheat in some ways. But on the other
    hand, it allows code reuse, by having different (overloaded) function
    entry points each do type-specific setup, then all branch to common
    code to execute the actual loop bodies.

    The SuperH also did this for the FPU:
    Didn't have enough encoding space to fit everything, so they sorta used
    FPU control bits to control which instructions were decoded.

    That was probably not cost-effective for scalar instructions, because it would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.

    Probably better for vector instructions, where one sequence of operand
    type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.

    Most use-cases for longer vectors tend to matrix-like rather than
    vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Back in the days of Seymour Cray, his machines were getting useful
    results out of vector lengths up to 64 elements.

    Perhaps that was more a substitute for parallel processing.

    Maybe. Just in my own experience, it seems to fizzle out pretty quickly.

    Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively- parallel supers, or RISC machines etc. Maybe the parallelism was thought
    to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)

    Short-vector SIMD was introduced along an entirely separate evolutionary path, namely that of bringing DSP-style operations into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    It may not count for Cray though, since IIRC their vectors were encoded
    as memory-addresses and they were effectively using pipelining tricks
    for the vectors.

    Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the two were bound to have underlying features in common.

    CDC 7600 had rather explicit pipelining--a lot more ordered than
    CDC 6600.

    So, in this case, a truer analog of Cray style vectors would not be variable width SIMD that can fake large vectors, but rather a mechanism
    to stream the vector through a SIMD unit.

    But short-vector SIMD can only deal with operands in lockstep. If you
    loosen this restriction, then you are back to multiple function units and superscalar execution.

    Which is a GOOD thing !!

    The visible difference between RGB555 and RGB24 is small;
    The difference between RGB24 and RGB30 in mostly imperceptible;
    Though, most modern LCD/LED monitors actually only give around 5 or 6
    bits per color channel (unlike the true analog on VGA CRTs, *).

    First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.

    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.

    Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
    imagery, the better the quality of the output.

    That is why the arithmetic is done in 16-bits.

    Sure, you may think 64-bit floats must be overkill for this purpose; but these are artists you’re dealing with. ;)

    Many can see gamut colors you cannot discern--and they care about it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 01:05:16 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations into
    general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    Actually, there was something Intel tried to do before that, called “NSP”, for “Native Signal Processing”, which was supposed to be the DSP-killer. Microsoft nixed that idea, for some reason.

    MMX came later, and as you may recall, it was a bit of a fudge (sharing registers with the floating-point unit), and not a very successful one at that. Intel couldn’t even decide what “MMX” meant: first it was supposed to be “Multi-Media eXtensions”, then that was changed to “means nothing at
    all”? Why? So it could be trademarked, of course!

    But short-vector SIMD can only deal with operands in lockstep. If you
    loosen this restriction, then you are back to multiple function units
    and superscalar execution.

    Which is a GOOD thing !!

    Which? Lockstep SIMD, or more asynchronous multiple function units?

    First of all, we have some “HDR” monitors around now that can output a >> much greater gradation of brightness levels. These can be used to
    produce apparent brightnesses greater than 100%.

    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.

    I think bragging rights alone will see it grow beyond that. Look at tandem OLEDs.

    Secondly, we’re talking about input image formats. Remember that every
    image-processing step is going to introduce some generational loss due
    to rounding errors; therefore the higher the quality of the raw input
    imagery, the better the quality of the output.

    That is why the arithmetic is done in 16-bits.

    Heck no. We’re talking up to 64-bit floats now.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Fri Oct 17 22:22:44 2025
    From Newsgroup: comp.arch

    On 2025-10-17 3:03 a.m., Lawrence D’Oliveiro wrote:

    POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
    subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to address in CTR and leave return address in LR).

    Something I like about the PowerPC, link register do not detract from
    the GPRs. A second link register would be handy.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Fri Oct 17 22:29:28 2025
    From Newsgroup: comp.arch

    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.

    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits.
    Bits beyond 8 would be for some sea creatures or viewable with special glasses?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 22:16:52 2025
    From Newsgroup: comp.arch

    On 10/17/2025 5:20 PM, Lawrence D’Oliveiro wrote:
    On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:

    On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

    On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    I know, you could consider that a cheat in some ways. But on the other
    hand, it allows code reuse, by having different (overloaded) function
    entry points each do type-specific setup, then all branch to common
    code to execute the actual loop bodies.

    The SuperH also did this for the FPU:
    Didn't have enough encoding space to fit everything, so they sorta used
    FPU control bits to control which instructions were decoded.

    That was probably not cost-effective for scalar instructions, because it would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.


    It was mostly needed, for example, when switching between Single and
    Double precision, and sucked...

    Though, can note that encoding for FPU ops looked like:
    1111-nnnn-mmmm-ZZZZ
    So, 16 possible 2R instructions for the FPU.

    There was in effect, not enough encoding space to do the FPU well with
    only 4 bits. So, the FPU encoding was modal.

    Still did pretty well.

    Now, seemingly RISC-V couldn't even manage with 25 bits, so effectively
    burns 7x 25-bit blocks (or nearly 28 bits of entropy) on the FPU.


    Probably better for vector instructions, where one sequence of operand
    type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.


    Yeah, but this works assuming that your vector ops are primarily mapped
    to long-running loops.

    In a lot of cases, you don't have this, and a large vector wont be usable.

    Consider, you want to write a function to fragment larger primitives
    into smaller primitives to minimize affine warping (where the number of
    input and output primitives will differ and you don't know in advance
    which primitives will fragment, etc). Likely, Cray-style vectors wont
    really help you there (but short-vector SIMD will help).


    Most use-cases for longer vectors tend to matrix-like rather than
    vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Back in the days of Seymour Cray, his machines were getting useful
    results out of vector lengths up to 64 elements.

    Perhaps that was more a substitute for parallel processing.

    Maybe. Just in my own experience, it seems to fizzle out pretty quickly.

    Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively- parallel supers, or RISC machines etc. Maybe the parallelism was thought
    to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)


    I think they were also mostly intended for CFD and FEM simulations and similar, or stuff that is very regular (running the same math over a
    whole lot of elements).


    Short-vector SIMD was introduced along an entirely separate evolutionary path, namely that of bringing DSP-style operations into general-purpose
    CPUs.


    Could be.

    Hadn't really looked that much into where SIMD came from originally.

    Some stuff I had read implied that vector processing came first, but
    then due to the limits of vector processing, supercomputers went over to
    SIMD; and then Intel added MMX presumably as an imitation of these supercomputers, and it went from there.


    It may not count for Cray though, since IIRC their vectors were encoded
    as memory-addresses and they were effectively using pipelining tricks
    for the vectors.

    Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the two were bound to have underlying features in common.


    OK.


    So, in this case, a truer analog of Cray style vectors would not be
    variable width SIMD that can fake large vectors, but rather a mechanism
    to stream the vector through a SIMD unit.

    But short-vector SIMD can only deal with operands in lockstep. If you
    loosen this restriction, then you are back to multiple function units and superscalar execution.


    Possibly.

    As can be noted, it makes sense to allow some amount of superscalar over
    the SIMD operations, but this gets limited by whatever is the most
    limited resource.

    In my project, this limit is mostly memory access.


    I did some more benchmarks, and also noted that in my old laptop, it is
    also mostly bound by memory access:
    It can't do vector multiply-accumulate faster than it can read the
    floating point data from memory and write back the results;
    And, the smallest floating-point format it has is Binary32.



    It is likely that to push either vector processing or SIMD to its full performance, one would need a massive amount of memory bandwidth.

    Or, say, on a desktop PC, to get one 128-bit SIMD vector per clock
    operating at 3.7GHz, would need roughly 90GB/sec of memory bandwidth,
    not likely to happen anytime soon...

    One can think the PC's CPU is flying when it does memcpy at 3.6 GB/sec
    or so, nowhere near enough.


    But, one thing that does help with relative performance in the face of a bandwidth limit (say, for NNs) is vectors with 8-bit elements and ~ 4-bit/element weights, and the ability to pipeline a lot of secondary
    ops (such as vector conversions) in parallel with other instructions.

    So, for example, if you can't do a memory load or store at the same time
    as a SIMD op, but you can do SIMD vector-conversions in parallel with
    SIMD ops or with memory accesses.



    Maybe it’s time to look beyond RGB colours. I remember some “Photo” >>> inkjet printers had 5 or 6 different colour inks, to try to fill out
    more of the CIE space. Computer monitors could do the same. Look at the
    OpenEXR image format that these CG folks like to use: that allows for
    more than 3 colour components, and each component can be a float --
    even single-precision might not be enough, so they allow for double
    precision as well.


    IME:
    The visible difference between RGB555 and RGB24 is small;
    The difference between RGB24 and RGB30 in mostly imperceptible;
    Though, most modern LCD/LED monitors actually only give around 5 or 6
    bits per color channel (unlike the true analog on VGA CRTs, *).

    First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.


    Possibly.

    My monitor has HDR, sorta, but I ended up not using it, as its effects
    were mostly it seems:
    Make image brighter in general
    Like it turns up effective brightness setting;
    Adds ringing artifacts around edges;
    Cause screen image to flicker every few minutes or so.
    Very annoying, like screen will just go black for a few seconds,
    often once every few minutes.

    Say, for example, with HDR turned on, a sudden sharp transition between
    Red and Green or similar will result in an ugly black line and ringing artifacts.

    Otherwise, if I wanted my monitor brighter, I could turn up the
    brightness level some more (I have it at a level that it doesn't burn my eyes).

    Kinda doesn't look great so not really worth it (vs leaving monitor in
    LDR mode).

    Seems more like a kind of gimmick.



    The more useful form of HDR IME is to use floating-point rendering and
    then render this out to LDR based on whatever is the current "exposure
    level" in the 3D rendering or similar.



    Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
    imagery, the better the quality of the output.


    Possibly, but here we still don't usually need much more than RGB24 or similar.

    Likewise, FP8U (E4.M4) is maybe pushing it a little on the low-end for
    HDR, but basically works.


    Meanwhile, a lot of late 1990s or early 2000s GPUs were like, "you are
    gonna take RGB555 and you are gonna like it".

    Like, say, I suspect the "Mobility Radeon 9000" in my older laptop is
    probably internally using RGB555 or RGBA4444 for textures (also probably
    with a 12|16 bit Z-Buffer), and reduced precision transform, ...

    Though, it predates having support for shaders. Also a weird quirk that
    if you encode DXT5 but then use the transparent-endpoint ordering from
    DXT1, the block seems to decode as it would in DXT1 (so, for DXT5, it
    needs to always use the opaque ordering to decode correctly).




    Sure, you may think 64-bit floats must be overkill for this purpose; but these are artists you’re dealing with. ;)


    Overkill is overkill.


    They can just be happy that in these modern times we are (mostly) free
    of indexed color and 16-color.

    Actually kinda hard to do non-terrible graphics in 16 color. Also, one
    may have to give up one of the colors for transparency. Typically, I had
    used hi-magenta as transparent color.

    But, sometimes, 16 colors is all you need.

    Like, behold, a video of a game from 2024 (Crimson Diamond):
    https://www.youtube.com/watch?v=3kOrATKd_Mc
    Where the whole game uses 16-color graphics...



    Had noted though that for me, IRL, monitors can't really represent real
    life colors. Like, I live in a world where computer displays all have a
    slight tint (with a similar tint and color distortion also applying to
    the output of color laser printers; and a different color distortion for
    inkjet printers).

    That is always true; “white” is never truly “white”, which is why those
    who work in colour always talk about a “white point” for defining what is meant by “white”, which is the colour of a perfect “black body” emitter at
    a specific temperature (typically 5500K or above).

    Yeah.

    Indoor lights typically come in "warm white" and "cool white". I usually prefer "cool white", but "warm while" is more common.

    Some people go on about how good LED looks vs CFL, but I actually
    slightly prefer the look of CFL. Both types of lighting screw up the
    colors, so it is a choice of which is "better", and in my case, I more
    lean towards CFL and fluorescent.

    Incandescent beat both of them, as did halogen (when the UV filter is in place, *). But, alas, pretty much no one uses halogen for indoor
    lighting, so at this point it is just sort of a choice between LED and fluorescent because people have gone and taken the incandescent bulbs away.

    *: Halogen looks good with a uv filter, and kinda terrible without a UV filter.


    Pretty much no one else seems to notice though, alas...

    Well, at this rate, maybe people will start trying to light their houses
    with magenta grow lights, "looks white enough to me...".

    Same basic issue, annoying when one seems to be the only person around
    that sees something.

    Or, basically, this issue: https://en.wikipedia.org/wiki/File:Led_grown_lights_useful.jpg

    Except nearly all the new LED bulbs are kinda like this (albeit not
    quite as extreme) and I am displeased. Like, man, I am not a plant, I
    don't need to live under grow lights.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Oct 17 22:56:59 2025
    From Newsgroup: comp.arch

    On 10/17/2025 9:29 PM, Robert Finch wrote:
    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.
    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits.
    Bits beyond 8 would be for some sea creatures or viewable with special glasses?


    I don't think I can see much beyond 7 or 8 bits.

    I can see the banding artifacts from RGB555.
    I can slightly see the difference between LCD and a VGA CRT.
    On an LCD, the dithering causes a slightly "gritty" look with gradients
    that is absent with a CRT. Mostly, the banding or grit isn't enough to
    be worth caring about.

    But, RGB555 is still a big step up from indexed color, and "almost
    mostly good enough", except when one needs an alpha channel or HDR (then
    it kinda falls on its face).

    Though, I had often still used a format that is either RGB555 or
    RGB444_A3, as often good enough (has 5-bits/channel when opaque, or 4
    bits when translucent, per pixel).



    Bigger annoyance to me is a "tint" that permeates pretty much all the artificial displays, and that also the newer LED bulbs have also adopted.

    Like, sort of a color that is like blue + yellow rather than a true
    white. There is no way to get rid of this tint, as it is like the tint
    in somehow itself a part of the RGB colorspace.

    We didn't really have this issue with CFLs.

    But, with monitors, I am at least mostly used to it; would prefer not to
    have the real-world tinted as well though.


    Had noted, inkjet printers don't have this particular issue though...
    They instead have the issue that nearly all the colors they print are
    biased towards looking crap-brown. Like, even if you use it to print a
    solid magenta, it still somehow has a crap-brown tinge to it (and only
    the plain white of non-printed parts of the page avoid this).

    Though, casually looking at it, it may not be obvious in isolation. It
    is more obvious if looking through a phone though: If the colors on a
    printed page change drastically between the real-life page, and the
    image seen through a phone screen, it is usually inkjet (and, if not, it
    is color laser).

    I suspect partly it is because color-laser (along with many plastic
    products) having the same sort of green+blue cyan color often seen on
    monitors and similar.


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 23:22:37 2025
    From Newsgroup: comp.arch

    On 10/17/2025 4:52 PM, EricP wrote:
    Stephen Fuld wrote:
    On 10/17/2025 12:43 PM, BGB wrote:
    On 10/17/2025 1:49 PM, Stephen Fuld wrote:

    As I am sure others will verify, the compatible descendants of the
    S/360 are alive in real hardware.  While I expect there haven't been >>>> any "new name" customers in a long time, the fact that IBM still
    introduces new chips every few years indicates that there is still a
    market for this architecture, presumably by existing customer's
    existing workload growth, and perhaps new applications related to
    existing ones.


    OK.

    I had thought it was the idea that IBM kept running the original ISA,
    but as an emulation layer on top of POWER rather than as the real
    hardware level ISA.

    I have heard that idea several times before.  I wonder where it came
    from?

    The AS400 cpu was replaced by Power and an emulation layer. https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

    Yes, sort of. Perhaps because IBM replaced the AS/400 with power,
    someone assumed (incorrectly) that they replaced all their proprietary
    CPUs with it.

    BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
    is unusual in having lots of its functionality done in software, so IBM
    "just" ported that software to Power. For the other stuff, while there
    was a sort of emulation layer, but the first time a program was run, it
    got silently recompiled to target the new architecture. Or something
    like that.


    The z-series was always a different cpu, but maybe they
    shared development groups with Power. The stages of the
    z15 core (2019) doesn't look anything like Power10 (2021).

    Right.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 23:44:01 2025
    From Newsgroup: comp.arch

    On 10/17/2025 5:37 PM, MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

    On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set >>>> Computer” -- because the one feature that really did become common was >>>> the larger register sets.

    Larger register sets were common, but not universal.

    Where is there an architecture you would class as “RISC”, but did not have
    a “large” register set?

    See Univac 1108

    I am not sure what you are saying here. While the 1108 did have some characteristics of RISC, such as fixed length instructions,it had some decidedly non RISCy features such as mem + op instructions, optional
    indirect memory addressing, and some instructions that could search
    multiple memory locations. Its register architecture was a little odd,
    but it wasn't small. There were essentially about 40 user registers,
    though some (16) were arithmetic only, some, 15, memory address only
    (sort of like Motorola 68K), but four of those actually "overlapped" the arithmetic registers so could be used for either, and some (15) only to
    store data.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 06:44:06 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 22:16:52 -0500, BGB wrote:

    On 10/17/2025 5:20 PM, Lawrence D’Oliveiro wrote:

    Probably better for vector instructions, where one sequence of operand
    type setup lets it then chug away to process a whole sequence of
    operand tuples in exactly the same way.

    Yeah, but this works assuming that your vector ops are primarily mapped
    to long-running loops.

    Maybe not. I recall in the Cray docs somewhere, that the break-even point
    for vector operations was as small as a vector size of 2. That is, if you
    had just two operand tuples, it was worth it to go through the vector- operation setup, instead of doing two sets of scalar operations.

    So RISC-V probably takes a bit more setup with the additional
    specification of operand types. But I suspect that will not move the break-even point up by, say, dozens of elements; probably only needs a few more elements to make it worthwhile.

    Maybe that was just a software thing: the Cray machines had their own
    architecture(s), which was never carried forward to the new massively-
    parallel supers, or RISC machines etc. Maybe the parallelism was
    thought to render deep pipelines obsolete -- at least in the early
    years. (*Cough* Pentium 4 *Cough*)

    I think they were also mostly intended for CFD and FEM simulations and similar, or stuff that is very regular (running the same math over a
    whole lot of elements).

    Also code breaking by Government spooks. There is a story of some guy in a presentation by Cray, who stood up at the back and stressed the importance
    of having population-count instructions, while refusing to go into detail about what he would use them for or even who he was.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat Oct 18 06:46:25 2025
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.

    Lapack's basics have not changed, but it is still actively maintained,
    with errors being fixed and new features added.

    If you look at the most recent major release, you will see that a lot
    is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
    One important thing seems to be changes to 64-bit integers.

    And I love changes like

    - B = BB*CS + DD*SN
    - C = -AA*SN + CC*CS
    + B = ( BB*CS ) + ( DD*SN )
    + C = -( AA*SN ) + ( CC*CS )

    which makes sure that compilers don't emit FMA instructions and
    change rounding (which, apparently, reduced accuracy enormously
    for one routine.

    (According to the Fortran standard, the compiler has to honor
    parentheses).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 06:46:44 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:

    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits. Bits beyond 8 would be for some sea creatures or viewable with special
    glasses?

    Under ideal conditions (comparing large areas), the human eye can
    distinguish about 10 million colours. Round that up to 2**24, and you get
    the traditional 8-by-8-by-8 RGB “full colour” space.

    However, consider your eye’s ability to adapt to a dynamic range from a
    dim room out into bright sunlight. Now imagine trying to simulate some of
    that in a movie, and you can see why the video images will need more than 8-by-8-by-8 dynamic range.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 06:54:26 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 23:22:37 -0700, Stephen Fuld wrote:

    BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
    is unusual in having lots of its functionality done in software, so IBM "just" ported that software to Power.

    There was some custom microcode added to the POWER chips specifically for
    the iSeries machines. I remember seeing a YouTube video where the
    presenter tried to make sense of some disassembled machine code -- it was mostly recognizable as POWER instructions, but the extras were not
    documented publicly anywhere.

    Might have been one of the videos on this channel <https://www.youtube.com/@MatthewMainframes/videos>, but I’m not sure.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat Oct 18 06:58:19 2025
    From Newsgroup: comp.arch

    David Brown <[email protected]> schrieb:
    On 17/10/2025 08:48, Lawrence D’Oliveiro wrote:
    On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve many >>> memory aliasing issues to use the vector ISA.

    Is this why C99 (and later) has the “restrict” qualifier
    <https://en.cppreference.com/w/c/language/restrict.html>?

    "restrict" can significantly improve non-vectored code too, as well as
    more "ad-hoc" vectoring of code where the compiler uses general-purpose registers, but interlaces loads, stores and operations to improve pipelining. But it is certainly a very useful qualifier for vector code.

    You can apply it to arguments, but then you cannot use other
    pointers as "shorthand", so

    void foo(int *restrict a)
    {
    int *restrict b = a;
    // Do something with b
    }

    is undefined.

    Fortran has it simpler: Arguments cannot alias each other, or
    things from COMMON blocks, or ... unless explicitly declared TARGET.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:05:41 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <[email protected]> posted:
    Interesting! I have also found rsqrt() to be a very good building block,
    to the point where if I can only have one helper function (approximate
    lookup to start the NR), it would be rsqrt, and I would use it for all
    of sqrt, fdiv and rsqrt.

    In practice:: RSQRT() is no harder to compute {both HW and SW},
    yet:: RSQRT() is more useful::

    SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
    RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV

    1/x = RSQRT(x)*RSQRT(x), also just one FMUL

    Useful in vector normalization::

    some-vector-calculation
    -----------------------
    SQRT( SUM(x**2,1,n) )

    and a host of others.

    Your last example is where I got involved with the issue: A Computation
    Fluid Chemistry researcher from Sweden reached out, he wanted to speed
    up Sqrt() which he believed to be the bottleneck when calculating the reciprocal distance for all his chemical force estimates.

    After looking at his source code, it was obvious that by directly
    calculating 1/sqrt(sum of squares), the speedup would be much more significant.

    In the end I created a function which calculated three RSqrt() values in parallel, this was by far the most common use case for any reaction
    taking place in a H2O solution, and it allowed almost all the latency
    delays to be overlapped between the three copies of the pipeline.

    In the end, his week-long simulations (running on Alpha and PentiumPro
    cpus) ran in exactly half the time so now he could double the number of
    runs.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:21:32 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:
    Short-vector SIMD was introduced along an entirely separate evolutionary
    path, namely that of bringing DSP-style operations into general-purpose
    CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of typically 8
    and 16-bit elements, it was the enabler for sw DVD decoding. ZoranDVD
    was the first to properly handle 30 frames/second with zero skips, it
    needed a PentiumMMX-200 to do so.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:25:16 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 10/17/2025 4:52 PM, EricP wrote:
    The AS400 cpu was replaced by Power and an emulation layer.
    https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

    Yes, sort of.  Perhaps because IBM replaced the AS/400 with power,
    someone assumed (incorrectly) that they replaced all their proprietary > CPUs with it.

    BTW, with the AS/400, power didn't emulate the older S/38 CPU.  AS/400
    is unusual in having lots of its functionality done in software, so IBM "just" ported that software to Power.  For the other stuff, while there
    was a sort of emulation layer, but the first time a program was run, it
    got silently recompiled to target the new architecture.  Or something
    like that.
    I consider AS/400 to be the blueprint for Mill's choice to have a model-portable distribution format that goes through the specializer in
    order to be compatible with the actual CPU model it is now running on.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:33:13 2025
    From Newsgroup: comp.arch

    Lawrence D’Oliveiro wrote:
    On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:

    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits. Bits
    beyond 8 would be for some sea creatures or viewable with special
    glasses?

    Under ideal conditions (comparing large areas), the human eye can
    distinguish about 10 million colours. Round that up to 2**24, and you get
    the traditional 8-by-8-by-8 RGB “full colour” space.
    10 million is more than what I've heard/seen, but OK:
    More interesting is the fact that females tend to have about 10x the
    ability to distinguish colors compared to men, due to the fact that the blue-green receptors are tied to the X chromosome, and they don't have
    to be exactly the same. I know this is true for my wife and me, but on
    the other hand I have much better monochrome vision so I can see better
    when it is quite dark.

    However, consider your eye’s ability to adapt to a dynamic range from a
    dim room out into bright sunlight. Now imagine trying to simulate some of that in a movie, and you can see why the video images will need more than 8-by-8-by-8 dynamic range.
    In reality they don't even (really) try. :-)
    Many years ago, they even had to shoot all night-time scenes during the
    day because the film and cameras didn't have nearly enough dynamic range.Terje --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat Oct 18 08:27:14 2025
    From Newsgroup: comp.arch

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:
    Where is there an architecture you would class as "RISC", but did not have
    a "large" register set?

    (How "large" is "large"? The VAX had 16 registers; was there any RISC >architecture with only that few?)

    The first IBM 801 has 16 registers. ARM A32/T32 has 16 registers (and
    shares the VAX's mistake of making the PC accessible as GPR). RV32E
    (and, I think, RV64E) has 16 registers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sat Oct 18 04:46:43 2025
    From Newsgroup: comp.arch

    On 10/18/2025 3:33 AM, Terje Mathisen wrote:
    Lawrence D’Oliveiro wrote:
    On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:

    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits. Bits
    beyond 8 would be for some sea creatures or viewable with special
    glasses?

    Under ideal conditions (comparing large areas), the human eye can
    distinguish about 10 million colours. Round that up to 2**24, and you get
    the traditional 8-by-8-by-8 RGB “full colour” space.

    10 million is more than what I've heard/seen, but OK:

    More interesting is the fact that females tend to have about 10x the
    ability to distinguish colors compared to men, due to the fact that the blue-green receptors are tied to the X chromosome, and they don't have
    to be exactly the same. I know this is true for my wife and me, but on
    the other hand I have much better monochrome vision so I can see better
    when it is quite dark.


    I seem to have a quirk that I see best in dim conditions; but am nearly blinded in direct sunlight (but, can see better with shade-4 or shade-5 glasses).

    For me, shade-5 seems to work best for daytime conditions:
    Shade 4 isn't quite dark enough;
    Shade 7 is too dark (difficult to see effectively with shade 7).

    Hard to find shade-5 glasses that aren't strongly tinted (usually
    green), found some that (merely) have a yellow tint, still better than monochromatic green.

    Though, despite some drawbacks (like being mostly monochromatic green),
    some shade-5 welding goggles are otherwise pretty effective at defeating
    the sun (and not letting light in from the side). Some dark sunglasses
    still have more of an issue with light leakage (where if light leaks in
    from the side and bounces off the inside of the lens, this isn't ideal
    for visibility). But, then it is also hard finding shade-5 glasses that
    aren't green, so, ...

    Most of the normal sunglasses aren't really dark enough (they need to be
    dark enough to be effective).




    Had noted, the outdoor conditions where I see best (which ironically
    looks the most like typical images of daytime conditions) is after the
    sun has set, but before it has gotten dark.

    So, say: Real daytime: nearly everything that the sun hits is covered in
    a white haze.

    Full night-time conditions are still dark though.
    So, alas, still no ability to see particularly well in night-time
    conditions either.



    However, consider your eye’s ability to adapt to a dynamic range from a
    dim room out into bright sunlight. Now imagine trying to simulate some of
    that in a movie, and you can see why the video images will need more than
    8-by-8-by-8 dynamic range.

    In reality they don't even (really) try. :-)


    Yes.


    Many years ago, they even had to shoot all night-time scenes during the
    day because the film and cameras didn't have nearly enough dynamic range.


    In the conditions I see best, things like my cellphone camera have a
    hard time taking good pictures (the images are dark and often have
    significant noise).

    Like, my room is at an OK light level for me, but phone sees it all as
    dark and grainy. Room is lit by a CFL bulb in an overhead holder
    (current bulb is 50W equivalent IIRC).



    Seemingly one has to put something under a bright lame (uncomfortably
    bright) before the phone camera can get an image that isn't dark and noisy.

    I can still see OK in situations where phone cameras just mostly give an
    all black image.

    However, the phone camera can see things better in brightly lit
    conditions than I can.


    Though, sometimes extra light can help, for example, although a little unpleasantly bright, using a lamp for things like soldering can be
    helpful (say, the lamp having a 40W equivalent CFL bulb).


    I once had a sort of mini desk lamp lit by a smaller bulb, I don't have
    it now. But, a lot of similar bulbs exist on Amazon.

    I don't see many with the same type of design, but the 2.5W E10 bulbs
    appear to be a similar category, and are fairly readily available.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Sat Oct 18 07:31:54 2025
    From Newsgroup: comp.arch

    On 10/18/2025 1:25 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 10/17/2025 4:52 PM, EricP wrote:
    The AS400 cpu was replaced by Power and an emulation layer.
    https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

    Yes, sort of.  Perhaps because IBM replaced the AS/400 with power,
    someone assumed (incorrectly) that they replaced all their proprietary
    CPUs with it.

    BTW, with the AS/400, power didn't emulate the older S/38 CPU.  AS/400
    is unusual in having lots of its functionality done in software, so
    IBM "just" ported that software to Power.  For the other stuff, while
    there was a sort of emulation layer, but the first time a program was
    run, it got silently recompiled to target the new architecture.  Or
    something like that.

    I consider AS/400 to be the blueprint for Mill's choice to have a model- portable distribution format that goes through the specializer in order
    to be compatible with the actual CPU model it is now running on.

    Absolutely. I think Ivan has even said so.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Sat Oct 18 17:16:00 2025
    From Newsgroup: comp.arch

    On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    First of all, we have some “HDR” monitors around now that can output a >>> much greater gradation of brightness levels. These can be used to
    produce apparent brightnesses greater than 100%.

    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.

    I think bragging rights alone will see it grow beyond that. Look at tandem OLEDs.


    Like many things, human perception of brightness is not linear - it is somewhat logarithmic. So even though we might not be able to
    distinguish anywhere close to 2000 different nuances of one primary
    colour, we /can/ perceive a very wide dynamic range. Having a large
    number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Sat Oct 18 13:16:17 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 20:54:23 GMT, MitchAlsup
    <[email protected]d> wrote:


    George Neuner <[email protected]> posted:



    Hope the attributions are correct.


    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
    <[email protected]d> wrote:


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    :
    In any case, even with these languages there are still software projects
    that fail, miss their deadlines and have overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out how to >> >make the (17 kinds of) hammers one needs, there it little need to make a
    new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have been
    happier... The mouse was more precise in W7 than in W8 ... With a little >> >upgrade for new PCIe architecture along the way rather than redesigning
    whole kit and caboodle for tablets and phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
    and few people would have cared. Many SW projects are driven not by demand >> >for the product, but pushed by companies to make already satisfied users
    have to upgrade.

    Those programmers could have transitioned to new SW projects rather than
    redesigning the same old thing 8 more times. Presto, there is now enough
    well trained SW engineers to tackle the undone SW backlog.

    The problem is that decades of "New & Improved" consumer products have
    conditioned the public to expect innovation (at minimum new packaging
    and/or advertising) every so often.

    Bringing it back to computers: consider that a FOSS library which
    hasn't seen an update for 2 years likely would be passed over by many
    current developers due to concern that the project has been abandoned.
    That perception likely would not change even if the author(s)
    responded to inquiries, the library was suitable "as is" for the
    intended use, and the lack of recent updates can be explained entirely
    by a lack of new bug reports.

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.

    Most Floating Point Libraries are in a similar position. They were
    updated after IEEE 754 became widespread and are as good today as
    ever.

    {FF1, Tomography, CFD, FEM} have needed no real changes in decades.

    Sometimes, Software is "done". You may add things to the package
    {like a new crescent wrench} but the old hammer works just as well
    today as 30 years ago when you bought it.


    I agree completely! However, numeric libraries are not what the
    average developer is looking for. For every 1 looking for a numerics
    library, there are 100,000 looking for some kind of web function,
    editing, data interchange, or database library.


    Why take a chance?

    On the last day of SW support for W10--they (THEY) updated several
    things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!

    Yeah, that happens too.


    To the SW vendor, they want to be able to update their SW any time
    they want. Yet, the application user wants the same bugs to remain
    constant over the duration of the WHOLE FRIGGEN project--because
    once you found them and figured a way around them, you don't want
    them to reappear somewhere else !!!

    There simply _must_ be a similar project somewhere
    else that still is actively under development. Even if it's buggy and
    unfinished, at least someone is working on it.

    I understand--but this bites more often than the conservative approach.

    YMMV but, as a software developer myself, this attitude makes me sick.
    8-(

    I was in a 3-year project where we had to forgo upgrading from SunOS
    to Solaris because the SW license model changes would have put us out
    of business before project completion.

    And that also. Clearly if the economics of the <whatsit> changes, you
    have to re-evaluate using it.

    Company I worked for had a handful of Sparc 5s running Solaris. We
    only used them in connection with board level debugger which we needed
    for developing some embedded projects running VxWorks on 68K VME. The
    Sparcs monitored the VME module and allowed replaying system level
    events to figure out what led to <whatever was going on>.

    I overheard the manager complaining that we could buy 3-4 top of line
    Pentium workstations for the cost of each Sparc. Unfortunately - at
    that time - the debugger/monitor software didn't run on x86. A few
    years later there was an x86 version introduced, but, by that time, we
    weren't doing anything that needed it.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Oct 18 20:25:08 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 08:27:14 GMT
    [email protected] (Anton Ertl) wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:
    Where is there an architecture you would class as "RISC", but did
    not have a "large" register set?

    (How "large" is "large"? The VAX had 16 registers; was there any
    RISC architecture with only that few?)

    The first IBM 801 has 16 registers. ARM A32/T32 has 16 registers (and
    shares the VAX's mistake of making the PC accessible as GPR). RV32E
    (and, I think, RV64E) has 16 registers.

    - anton

    I wouldn't count 801, because it's was a concept rather than production
    CPU. But ROMP does count. Not success, but a product nevertheless.
    Another (apart from ARM) succesful RISC with small register file is
    Hitachi SH.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Oct 18 21:42:17 2025
    From Newsgroup: comp.arch

    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <[email protected]d> wrote:

    George Neuner <[email protected]> posted:



    Hope the attributions are correct.


    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup <[email protected]d> wrote:


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    :
    In any case, even with these languages there are still
    software projects that fail, miss their deadlines and have
    overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out
    how to make the (17 kinds of) hammers one needs, there it little
    need to make a new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have
    been happier... The mouse was more precise in W7 than in W8 ...
    With a little upgrade for new PCIe architecture along the way
    rather than redesigning whole kit and caboodle for tablets and
    phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998,
    ... and few people would have cared. Many SW projects are driven
    not by demand for the product, but pushed by companies to make
    already satisfied users have to upgrade.

    Those programmers could have transitioned to new SW projects
    rather than redesigning the same old thing 8 more times. Presto,
    there is now enough well trained SW engineers to tackle the undone
    SW backlog.

    The problem is that decades of "New & Improved" consumer products
    have conditioned the public to expect innovation (at minimum new
    packaging and/or advertising) every so often.

    Bringing it back to computers: consider that a FOSS library which
    hasn't seen an update for 2 years likely would be passed over by
    many current developers due to concern that the project has been
    abandoned. That perception likely would not change even if the
    author(s) responded to inquiries, the library was suitable "as is"
    for the intended use, and the lack of recent updates can be
    explained entirely by a lack of new bug reports.

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.


    It is possible that LAPAC API was not updated in decades, although I'd
    expect that even at API level there were at least small additions, if
    not changes. But if you are right that LAPAC implementation was not
    updated in decade than you could be sure that it is either not used by
    anybody or used by very few people.

    Personally, when I need LAPAC-like functionality then I tend to use
    BLAS routines either from Intel MKL or from OpenBLAS. Both libraries
    are not just updated, but more like permanently re-written.
    I'm pretty sure that the same applies to Apple's implementations
    of BLAS and LAPAC.
    And, of course, it apply GPGPU implementation both from NV and from AMD
    and more recently from Intel as well.

    Most Floating Point Libraries are in a similar position. They were
    updated after IEEE 754 became widespread and are as good today as
    ever.

    {FF1, Tomography, CFD, FEM} have needed no real changes in decades.

    Sometimes, Software is "done". You may add things to the package
    {like a new crescent wrench} but the old hammer works just as well
    today as 30 years ago when you bought it.


    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    Why take a chance?

    On the last day of SW support for W10--they (THEY) updated several
    things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!

    To the SW vendor, they want to be able to update their SW any time
    they want. Yet, the application user wants the same bugs to remain
    constant over the duration of the WHOLE FRIGGEN project--because
    once you found them and figured a way around them, you don't want
    them to reappear somewhere else !!!

    There simply _must_ be a similar project
    somewhere else that still is actively under development. Even if
    it's buggy and unfinished, at least someone is working on it.

    I understand--but this bites more often than the conservative
    approach.
    YMMV but, as a software developer myself, this attitude makes me
    sick. 8-(

    I was in a 3-year project where we had to forgo upgrading from SunOS
    to Solaris because the SW license model changes would have put us out
    of business before project completion.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat Oct 18 19:24:21 2025
    From Newsgroup: comp.arch

    Michael S <[email protected]> schrieb:

    It is possible that LAPAC API was not updated in decades,

    The API of existing LAPACK routines was not changed (AFAIK),
    but there were certainly additions. It is also possible to chose
    64-bit integers at build time.

    although I'd
    expect that even at API level there were at least small additions, if
    not changes. But if you are right that LAPAC implementation was not
    updated in decade than you could be sure that it is either not used by anybody or used by very few people.

    It is certainly in use by very many people, if indirectly, for example
    by Python or R. I learned about R the hard way, when a wrong interface
    in the C bindings of Lapack surfaced after a long, long time.

    Personally, when I need LAPAC-like functionality then I tend to use
    BLAS routines either from Intel MKL or from OpenBLAS.

    Different level of application. You use LAPACK when you want to do
    things like calculating eigenvalues or singular value decomposition,
    see https://www.netlib.org/lapack/lug/node19.html . If you use
    BLAS directly, you might want to check if there is a routine
    in LAPACK which does what you need to do.

    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    I agree. There is a _lot_ of active research in numerical
    algorithms, be it for ODE systems, sparse linear solvers or whatnot.
    A lot of that is happening in Julia, actually.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Sat Oct 18 19:36:17 2025
    From Newsgroup: comp.arch

    According to EricP <[email protected]>:
    I had thought it was the idea that IBM kept running the original ISA,
    but as an emulation layer on top of POWER rather than as the real
    hardware level ISA.

    I have heard that idea several times before. I wonder where it came from?

    The AS400 cpu was replaced by Power and an emulation layer. >https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

    The S/38 and AS/400 had a virtual instruction set called TIMI which
    was translated into native code the first time a program is run.
    They didn't write an emulation layer. They just wrote a new
    translator to POWER rather than to the previous low level architecture.

    I gather most phones do the same thing, translating JVM or ART code to
    native code when it installs an app.

    The z-series was always a different cpu, but maybe they
    shared development groups with Power. The stages of the
    z15 core (2019) doesn't look anything like Power10 (2021).

    https://www.servethehome.com/wp-content/uploads/2020/08/Hot-Chips-32-IBM-Z15-Processor-Pipeline.jpg

    I would expect them to be different since z has to run S/360 code which is rather different from
    POWER code.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Oct 18 23:11:38 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:

    It is possible that LAPAC API was not updated in decades,

    The API of existing LAPACK routines was not changed (AFAIK),
    but there were certainly additions. It is also possible to chose
    64-bit integers at build time.

    although I'd
    expect that even at API level there were at least small additions,
    if not changes. But if you are right that LAPAC implementation was
    not updated in decade than you could be sure that it is either not
    used by anybody or used by very few people.

    It is certainly in use by very many people, if indirectly, for example
    by Python or R.

    Does Python (numpy and scipy, I suppose) or R linked against
    implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch? Somehow, I don't believe it.
    I don't use either of the two for numerics (I use python for other
    tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
    for Matlab.


    I learned about R the hard way, when a wrong
    interface in the C bindings of Lapack surfaced after a long, long
    time.

    Personally, when I need LAPAC-like functionality then I tend to use
    BLAS routines either from Intel MKL or from OpenBLAS.

    Different level of application. You use LAPACK when you want to do
    things like calculating eigenvalues or singular value decomposition,
    see https://www.netlib.org/lapack/lug/node19.html . If you use
    BLAS directly, you might want to check if there is a routine
    in LAPACK which does what you need to do.

    Higher-level algos I am interested in are mostly our own inventions.
    I can look, of course, but the chances that they are present in LAPACK
    are very low.
    In fact, Even BLAS L3 I don't use all that often (and lower levels
    of BLAS never).
    Not because APIs do not match my needs. They typpically do. But
    because standard implementations are optimized for big or huge matrices.
    My needs are medium matrices. A lot of medium matrices.
    My own implementations of standard algorithms for medium-sized
    matrices, most importantly of Cholesky decomposition, tend to be much
    faster than those in OTS BLAS librares. And preparatioon of my own
    didn't take a lot of time. After all those are simple algorithms.


    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    I agree. There is a _lot_ of active research in numerical
    algorithms, be it for ODE systems, sparse linear solvers or whatnot.
    A lot of that is happening in Julia, actually.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@[email protected] (Waldek Hebisch) to comp.arch on Sat Oct 18 22:10:35 2025
    From Newsgroup: comp.arch

    Lawrence D’Oliveiro <[email protected]d> wrote:
    On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

    On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
    Lawrence D’Oliveiro <[email protected]d> wrote:

    From the beginning, I felt that the much-trumpeted reduction in
    instruction set complexity never quite matched up with reality. So I
    thought a better name would be “IRSC”, as in “Increased Register Set >>> Computer” -- because the one feature that really did become common was >>> the larger register sets.

    Larger register sets were common, but not universal.

    Where is there an architecture you would class as “RISC”, but did not have
    a “large” register set?

    (How “large” is “large”? The VAX had 16 registers; was there any RISC
    architecture with only that few?)

    Cortex M0 has only 8 general purpose registers. There are 8 other
    ARM registers, but on Cortex M0 they can be used only by selected
    instructions.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 22:20:14 2025
    From Newsgroup: comp.arch

    Speaking of Cray, the US Mint are issuing some new $1 coins featuring
    various famous persons/things, and one of them has a depiction of the
    Cray-1 on it.

    From the photo I’ve seen, it’s an overhead view, looking like a
    stylized letter C. So I wonder, even with the accompanying legend
    “CRAY-1 SUPERCOMPUTER”, how many people will realize that’s actually a picture of the computer?

    <https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@[email protected] (Waldek Hebisch) to comp.arch on Sat Oct 18 22:22:32 2025
    From Newsgroup: comp.arch

    David Brown <[email protected]> wrote:
    On 16/10/2025 23:26, BGB wrote:
    On 10/16/2025 2:04 AM, David Brown wrote:
    On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better >>>>>> than SIMD, if only because it preserves the “R” in “RISC”.

    The R in RISC-V comes from "student _R_esearch".

    “Reduced Instruction Set Computing”. That was what every single
    primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or >>>>> vice versa)--they simply represent different ways of shooting yourself >>>>> in the foot.

    The primary design criterion, as I understood it, was to avoid
    filling up
    the instruction opcode space with a combinatorial explosion. (Or
    sequence
    of combinatorial explosions, when you look at the wave after wave of
    SIMD
    extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on
    different hardware.  With SIMD, you need different code if your
    processor can add 4 ints at a time, or 8 ints, or 16 ints - it's all
    different instructions using different SIMD registers.  With the
    vector style instructions in RISC-V, the actual SIMD registers and
    implementation are not exposed to the ISA and you have the same code
    no matter how wide the actual execution units are.  I have no
    experience with this (or much experience with SIMD), but that seems
    like a big win to my mind.  It is akin to letting the processor
    hardware handle multiple instructions in parallel in superscaler cpus,
    rather than Itanium EPIC coding.


    But, there is problem:
    Once you go wider than 2 or 4 elements, cases where wider SIMD brings
    more benefit tend to fall off a cliff.

    More so, when you go wider, there are new problems:
      Vector Masking;
      Resource and energy costs of using wider vectors;
      ...


    I appreciate that. Often you will either be wanting the operations to
    be done on a small number of elements, or you will want to do it for a
    large block of N elements which may be determined at run-time. There
    are some algorithm, such as in cryptography, where you have sizeable but fixed-size blocks.

    When you are dealing with small, fixed-size vectors, x86-style SIMD can
    be fine - you can treat your four-element vectors as single objects to
    be loaded, passed around, and operated on. But when you have a large run-time count N, it gets a lot more inefficient. First you have to
    decide what SIMD extensions you are going to require from the target,
    and thus how wide your SIMD instructions will be - say, M elements.
    Then you need to loop N / M times, doing M elements at a time. Then you need to handle the remaining N % M elements - possibly using smaller
    SIMD operations, possibly doing them with serial instructions (noting
    that there might be different details in the implementation of SIMD and serial instructions, especially for floating point).

    In many cases one can enlarge data structures to multiple of SIMD
    vector size (and align them properly). There requires some extra
    code, but mot too much and all of it is outside inner loop. So,
    there is some waste, but rather small due to unused elements.

    Of course, there is still trouble due to different SIMD vector
    size and/or different SIMD instructions sets.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:08:58 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 19:24:21 -0000 (UTC), Thomas Koenig wrote:

    [LAPACK] is certainly in use by very many people, if indirectly, for
    example by Python or R.

    Certainly used by NumPy:

    ldo@theon:~> apt-cache depends python3-numpy
    python3-numpy
    ...
    |Depends: libblas3
    Depends: <libblas.so.3>
    libblas3
    libblis4-openmp
    libblis4-pthread
    libblis4-serial
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    |Depends: liblapack3
    Depends: <liblapack.so.3>
    liblapack3
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:11:37 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 23:11:38 +0300, Michael S wrote:

    I don't use either of the two for numerics (I use python for other
    tasks). But I use Matlab and Octave. I know for sure that Octave
    uses relatively new implementations, and pretty sure that the same
    goes for Matlab.

    On my system, Octave uses exactly the same version of LAPACK as NumPy
    does:

    ldo@theon:~> apt-cache depends octave
    octave
    ...
    Depends: <libblas.so.3>
    libblas3
    libblis4-openmp
    libblis4-pthread
    libblis4-serial
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    |Depends: liblapack3
    Depends: <liblapack.so.3>
    liblapack3
    libopenblas0-openmp
    libopenblas0-pthread
    libopenblas0-serial
    ...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:17:19 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much
    video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:20:16 2025
    From Newsgroup: comp.arch

    On Sat, 18 Oct 2025 22:22:32 -0000 (UTC), Waldek Hebisch wrote:

    In many cases one can enlarge data structures to multiple of SIMD vector
    size (and align them properly). There requires some extra code, but mot
    too much and all of it is outside inner loop. So, there is some waste,
    but rather small due to unused elements.

    Of course, there is still trouble due to different SIMD vector size
    and/or different SIMD instructions sets.

    Just so long as you keep such optimized data structures *internal* to the program, and don’t make them part of any public interchange format!

    Interchange formats tend to outlive the original technological milieu they were created in, and decisions made for the sake of technical limitations
    of the time can end up looking rather ... anachronistic ... just a few
    years down the track.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Oct 19 01:56:03 2025
    From Newsgroup: comp.arch

    On 10/18/2025 10:16 AM, David Brown wrote:
    On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    First of all, we have some “HDR” monitors around now that can output a >>>> much greater gradation of brightness levels. These can be used to
    produce apparent brightnesses greater than 100%.

    It is unlikely that monitors will ever get much beyond 11-bits of pixel
    depth per color.

    I think bragging rights alone will see it grow beyond that. Look at
    tandem
    OLEDs.


    Like many things, human perception of brightness is not linear - it is somewhat logarithmic.  So even though we might not be able to
    distinguish anywhere close to 2000 different nuances of one primary
    colour, we /can/ perceive a very wide dynamic range.  Having a large
    number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.


    Possible, but it is a question if high bit depth would make much
    difference. We are still in a case where HDMI usually sends 8 or
    sometimes 10 bits per channel, but displays are generally limited to 5
    or 6 bits (and may then dither stuff on the display side).


    Then we have:
    Traditional LCD: Uses a fluorescent backlight;
    LED: Typically LCD + LED backlights;
    OLED: Panel itself uses LEDs
    Typically much more expensive;
    Notoriously short lifespan.

    I have a display, LED+LCD tech, it has an HDR mode, but it isn't great.
    As noted, it seems like it mostly turns up the brightness and uses image processing wonk (which adds a bunch of artifacts).

    And, if I wanted 25% brighter, I could turn the brightness setting from
    40 to 50 or similar (checks, current settings being 40% brightness, 60% contrast).



    Then, we have HDR in 3D rendering which is, as noted, not usually about
    the monitor, but about using floating-point for rendering (typically
    with LDR for the final output).

    Often it still makes sense to use LDR for textures, but then HDR for the framebuffer (since the HDR is usually more a product of the lighting
    than the materials).

    Binary16 is plenty of precision for framebuffer.
    Though, often FP8U (E4.M4) is likely to still be acceptable.

    Where:
    E3.M5: Not really enough dynamic range.
    E4.M4: OK (Comparable to RGB555)
    E5.M3: Image quality is poor (worse than RGB555).

    We usually give up sign with smaller formats, assuming that any values
    which would go negative are clamped to 0, as it is harder in this case
    to justify spending a bit on being able to represent negative colors.

    For native Binary16, may as well allow negatives.



    There is a question of the best way to store HDR images:
    4x FP16: High quality, but expensive
    4x FP8U: More affordable, can do RGBA
    RGB8_E8: good for opaque images, works OK.
    RGB8_EA4: OK, non-standard.
    RGB9_E5: Good for opaque images
    RG11_B10: E5.M6 | E5.M5

    For files, currently ignoring EXR, but this is typically similar tech to
    the TGA format in most cases (raw floats, or maybe with RLE, very
    bulky). There are other options, but when I encountered EXR images in
    the past, they were being used basically like the TGA format.


    For a format like my UPIC design, could likely (in theory) handle
    components of up to around 14 bits. Problem becomes the range of
    quantizer values, where at high bit-depths an 8-bit quantization table
    value may be no longer sufficient.

    In this case, the limiting factor is that A-B needs to stay within int16
    range (both the internal buffers and coefficient encoding maxes out at
    int16 range).

    For T.81 JPEG, there are a rarely used variants that have 10 and 12 bit components (where, JPEG has a lot of the same basic issues here).
    Though, a lot of what people assume are the limits of T.81 JPEG, are
    actually the limits of JFIF.


    With either format, using 12 bits makes sense, as this isn't too far
    outside the range of the 8-bit quantization values (mostly sets a limit
    to how low of quality 0% can achieve; though likely does mean likely
    scaling the quantizer values by 8x vs whatever they would be for that
    quality level with LDR, and clamping them between 1 and 255).


    So, one possibility could be, say:
    Image can represent values as 12 bits: E5.M7

    Or, maybe allow negative components as well, likely in ones' complement
    form. Though, this would be unusual if using JPEG as a base as they tend
    not to use negative components even if nothing in the design of the
    format necessarily prevents the use of negative components.

    Depending on needs, could be decoded as Binary16 or as one of the other formats.

    Though, another option is to just store the images with 8-bit E4.M4
    components (so, from the codec's POV, it is the same as with an LDR image).



    Then again, someone might want lossless Binary16, but my UPIC format
    couldn't do this as-is, since doing so would exceed current value ranges.

    I would likely need to hack the VLC scheme to allow for larger coefficients.

    As-is, table looks like (V prefix, extra bits, unsigned range)
    0/ 1, 0, 0.. 1 2/ 3, 0, 2.. 3
    4/ 5, 1, 4.. 7 6/ 7, 2, 8.. 15
    8/ 9, 3, 16.. 31 10/11, 4, 32.. 63
    12/13, 5, 64.. 127 14/15, 6, 128.. 255
    16/17, 7, 256.. 511 18/19, 8, 512.. 1023
    20/21, 9, 1024.. 2047 22/23, 10, 2048.. 4095
    24/25, 11, 4096.. 8191 26/27, 12, 8192..16383
    28/29, 13, 16384..32767 30/31, 14, 32768..65535

    So, with the zigzag folding, this expresses a 16-bit range.

    Both the Block-Haar and RCT effectively cost 1 bit of dynamic range,
    meaning as-is, leaving the widest allowed component as 14-bits (signed
    range).

    Though, one possibility would be hacking the upper end of the table (not otherwise used for LDR images) to use a steeper step with a 16-bit
    components range, say:
    24, 12, 4096.. 8191
    25, 13, 8192.. 16383
    26, 14, 16384.. 32767
    27, 15, 32768.. 65536
    28, 16, 65536.. 131071
    29, 17, 131072.. 262143
    30, 18, 262144.. 524287
    31, 19, 524288..1048575

    Which (if using 32-bits for transform coefficients) would exceed the
    dynamic range needed for 16-bit coefficients (roughly +/- 262144 if unbalanced).

    Might need to define a special case for 16-bit quantization tables to
    allow for effective lossy compression though. Most naive option is that,
    if the quantization table has 128 bytes of payload (vs 64) it is assumed
    to use 16-bit components.


    Well, and then one can debate whether RCT, Haar, etc, are still the best options. Well, and (if 12 bit components were used), how the VLC scheme
    would be understood (or if Binary16 would effectively preclude such a
    12-bit encoding scheme as redundant).


    May or may not have a use-case for such a thing, TBD.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sun Oct 19 07:55:57 2025
    From Newsgroup: comp.arch

    Michael S <[email protected]> schrieb:
    On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:

    It is possible that LAPAC API was not updated in decades,

    The API of existing LAPACK routines was not changed (AFAIK),
    but there were certainly additions. It is also possible to chose
    64-bit integers at build time.

    although I'd
    expect that even at API level there were at least small additions,
    if not changes. But if you are right that LAPAC implementation was
    not updated in decade than you could be sure that it is either not
    used by anybody or used by very few people.

    It is certainly in use by very many people, if indirectly, for example
    by Python or R.

    Does Python (numpy and scipy, I suppose) or R linked against
    implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch?

    No, they don't (as I learned). They would cut themselves off
    from all the improvements and bug fixes since then.

    Somehow, I don't believe it.
    I don't use either of the two for numerics (I use python for other
    tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
    for Matlab.

    I would be surprised otherwise.

    Personally, when I need LAPAC-like functionality then I tend to use
    BLAS routines either from Intel MKL or from OpenBLAS.

    Different level of application. You use LAPACK when you want to do
    things like calculating eigenvalues or singular value decomposition,
    see https://www.netlib.org/lapack/lug/node19.html . If you use
    BLAS directly, you might want to check if there is a routine
    in LAPACK which does what you need to do.

    Higher-level algos I am interested in are mostly our own inventions.
    I can look, of course, but the chances that they are present in LAPACK
    are very low.
    In fact, Even BLAS L3 I don't use all that often (and lower levels
    of BLAS never).
    Not because APIs do not match my needs. They typpically do. But
    because standard implementations are optimized for big or huge matrices.
    My needs are medium matrices. A lot of medium matrices.
    My own implementations of standard algorithms for medium-sized
    matrices, most importantly of Cholesky decomposition, tend to be much
    faster than those in OTS BLAS librares. And preparatioon of my own
    didn't take a lot of time. After all those are simple algorithms.

    For the same reason, I implemented unrolling of MATMUL for small
    matrices in gfortran a few years ago. If all you are doing are
    small matrices (especially of constant size), the compiler can
    do a better job from straight loop. By the time the optimized
    matmul routines have started up their machinery, the calculation
    is already done.

    I had to be careful about benchmarking, though. I had to hide the
    fact that I was not actually using the results from the compiler,
    otherwise I got extremely fast execution times for what was
    essentially a no-op. My standard method now is to select a pair
    of array indices where the compiler cannot see them (read from
    a string) and then write out a single element at that position,
    also to a string.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sun Oct 19 16:52:12 2025
    From Newsgroup: comp.arch

    Lawrence D’Oliveiro wrote:
    Speaking of Cray, the US Mint are issuing some new $1 coins featuring
    various famous persons/things, and one of them has a depiction of the
    Cray-1 on it.

    From the photo I’ve seen, it’s an overhead view, looking like a
    stylized letter C. So I wonder, even with the accompanying legend “CRAY-1 SUPERCOMPUTER”, how many people will realize that’s actually a
    picture of the computer?

    <https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>
    My guess: Well below 0.1% unless they get told what it is.
    It was not obvious to me, and I have sat on the Cray bench several
    times, both in Trondheim (in active use at the time) and in the Computer History Museum in Silicon Valley man years later. (Maybe the latter is a faulty recollection, and I only got to look at it at that time? It was
    during a private showing of the collection.)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Oct 19 19:31:50 2025
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    It is unlikely that monitors will ever get much beyond 11-bits of pixel depth per color.

    I do not understand why monitor would go beyond 9-bits. Most people
    can't see beyond 7 or 8-bits color component depth. Keeping the
    component depth 10-bits or less allows colors to fit into 32-bits.

    My point was that there is a physical limit on how closely one can
    illuminate a colored pixel--and that limit is around 11-bits. Just
    like there is a limit on how good one can make an A/D converter which
    is around 22-bits.

    I did not imply that a person could SEE that fine a granularity, just
    that one could build a screen that had that fine a granularity.

    Bits beyond 8 would be for some sea creatures or viewable with special glasses?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Oct 19 19:37:03 2025
    From Newsgroup: comp.arch


    Thomas Koenig <[email protected]> posted:

    MitchAlsup <[email protected]d> schrieb:

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.

    Lapack's basics have not changed, but it is still actively maintained,
    with errors being fixed and new features added.

    If you look at the most recent major release, you will see that a lot
    is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
    One important thing seems to be changes to 64-bit integers.

    And I love changes like

    - B = BB*CS + DD*SN
    - C = -AA*SN + CC*CS
    + B = ( BB*CS ) + ( DD*SN )
    + C = -( AA*SN ) + ( CC*CS )

    which makes sure that compilers don't emit FMA instructions and
    change rounding (which, apparently, reduced accuracy enormously
    for one routine.

    FFT is sensitive to NOT using FMAC--that is the error across
    butterflies is lower with FMUL FMUL and FADD than FMUL FMAC.
    This has to do with distributing the error evenly whereas FMAC
    makes one of the calculations better.

    (According to the Fortran standard, the compiler has to honor
    parentheses).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Oct 19 19:42:35 2025
    From Newsgroup: comp.arch


    Michael S <[email protected]> posted:

    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <[email protected]d> wrote:


    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    Are you suggesting that a brand new #3 ball peen hammer is usefully
    better than a 30 YO #3 ball peen hammer ???
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Sun Oct 19 18:07:10 2025
    From Newsgroup: comp.arch

    On Sun, 19 Oct 2025 19:42:35 GMT, MitchAlsup
    <[email protected]d> wrote:


    Michael S <[email protected]> posted:

    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <[email protected]d> wrote:


    No, old hammer does not work well. Unless you consider delivering
    5-10% of possible performance as "working well".

    Are you suggesting that a brand new #3 ball peen hammer is usefully
    better than a 30 YO #3 ball peen hammer ???

    With repeated use hammers become brittle. A 30yo hammer is more likely
    to crack and/or chip than is a new one.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Mon Oct 20 08:57:42 2025
    From Newsgroup: comp.arch

    On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.

    Agreed. But having SIMD made audio processing more efficient, which was
    a nice bonus - especially if you wanted more than CD quality audio.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Mon Oct 20 11:06:08 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much
    video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.

    Agreed.  But having SIMD made audio processing more efficient, which was
    a nice bonus - especially if you wanted more than CD quality audio.
    Having SIMD available was a key part of making the open source Ogg
    Vorbis decoder 3x faster.
    It worked on MMX/SSE/SSE2/Altivec.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Mon Oct 20 14:21:14 2025
    From Newsgroup: comp.arch

    On 10/20/2025 4:06 AM, Terje Mathisen wrote:
    David Brown wrote:
    On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
    On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

    MitchAlsup wrote:

    On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

    Short-vector SIMD was introduced along an entirely separate
    evolutionary path, namely that of bringing DSP-style operations
    into general-purpose CPUs.

    MMX was designed to kill off the plug in Modems.

    MMX was quite obviously (also) intended for short vectors of
    typically 8 and 16-bit elements, it was the enabler for sw DVD
    decoding. ZoranDVD was the first to properly handle 30 frames/second
    with zero skips, it needed a PentiumMMX-200 to do so.

    I think the initial “killer app” for short-vector SIMD was very much
    video encoding/decoding, not audio encoding/decoding. Audio was
    already easy enough to manage with general-purpose CPUs of the 1990s.

    Agreed.  But having SIMD made audio processing more efficient, which
    was a nice bonus - especially if you wanted more than CD quality audio.

    Having SIMD available was a key part of making the open source Ogg
    Vorbis decoder 3x faster.

    It worked on MMX/SSE/SSE2/Altivec.


    Yeah. Audio is fun...


    But MP3 and Vorbis have the odd property of either sounding really good
    (at high bitrates) or terrible (at lower bitrates, particularly if used
    for something with variable playback speed).

    Seems to be a general issue with audio codecs built from a similar sort
    of block-transform approach (such as MDCT or WHT).


    In some of my own experiments in a similar area, I had used WHT, but
    didn't get quite so good of results. One problem seems to be that there
    is a sort of big issue with frequencies near the block-size, which
    result in nasty artifacts. The overlapping blocks and windowing of MDCT
    reduce this issue, but as noted, MDCT has a high computational cost (vs
    Haar or WHT).

    have yet to come up with something in this category that gives
    satisfactory results (cheap, simple, effective, and passable quality).


    Can also note: ADPCM works OK.

    Can get better results IMO at bitrates lower than where MP3 or Vorbis
    are effective.

    Near the lower end:
    16kHz 2-bit ADPCM: OK, 32kbps
    11kHz 2-bit ADPCM: meh, 22kbps
    8kHz 4-bit ADPCM: Weak, 32kbps
    8kHz 2-bit ADPCM: poor, 16kbps


    Getting OK results at 2-bits/sample requires a different approach from
    what works well at 4 bits, namely rather than encoding one sample at a
    time, it is usually needed to encode a block of samples at a time and
    then search the entire possibility space. Trying to encode samples one
    at a time gives poor results. This makes 2-bit encoding slower and more complicated than 4-bit encoding (but decoder can still be fast).

    As noted, ADPCM proper does not work below 2 bits/sample.

    The added accuracy of 4-bit samples is not an advantage in this case
    since the reduction in sample rate has a more obvious negative impact here.


    After trying a few experiments, the current front-runner for going lower is: Encode a group of 8 or 16 samples as an 8-bit index into a table of
    patterns (such as groups of 2-bit ADPCM samples);
    This can achieve 1.0 or 0.5 bits/sample.

    Have yet to get anything with particularly acceptable audio quality though.

    Did end up resorting to using genetic algorithms for building the
    pattern tables for these experiments. I did previously experiment with
    an interpolation pattern table, but this gave worse results.


    One other line of experimentation was trying to fudge the ADPCM encoding algorithm to preferentially try to generate repeating patterns over
    novel ones with the aim of making it more compressible with LZ77.

    However, it was difficult to significantly improve LZ compressibility
    while still maintaining some semblance of audio quality. Neither
    byte-oriented LZ (eg, LZ4) not Deflate, was particularly effective.


    Did note however that both LZMA and an LZMA style bitwise range encoder
    were much more effective (particularly with 12 or 16 bits of context).

    However, a range encoder is near the upper end of computational
    feasibility (and using a range encoder to squeeze bits out of ADPCM
    seems kinda absurd).


    One intermediate option seems to be a permutation transform. This can
    make the data more amendable to STF+AdRice or Huffman.

    Say, a 2-bit permutation is transform possible (though, in this case one
    can represent every permutation as a 5-bit finite state machine, stored
    as bytes in RAM for convenience). This does have the nice property that
    one can use an 8 bit table lookup for each context which then produces 2
    bits of output at a time.

    Say:
    hist: 8 bits of history
    ival: input, 4x 2-bits
    oval: output, 4x 2-bits, permuted

    px1=permstate[hist];
    ix=((ival>>0)&0x03);
    px2=permupdtab[(px1&0xFC)|ix];
    permstate[hist]=px2;
    hist=(hist<<2)|ix;
    oval=px2&3;

    px1=permstate[hist];
    ix=((ival>>2)&0x03);
    px2=permupdtab[(px1&0xFC)|ix];
    permstate[hist]=px2;
    hist=(hist<<2)|ix;
    oval=oval|((px2&3)<<2);
    ...

    Decoding process is similar

    One downside of this is that they are still about as slow as using the
    bitwise range-coder would have been.


    Also, still doesn't really allow breaking into sub 10 kbps territory
    without a loss of quality. The use of pattern tables allows breaking
    into this territory with a similar loss of quality, and at a lower computational cost.

    Though, it seems possible that the permutation transform could be
    directly integrated with the ADPCM decoder (in effect turning it into
    more of a predictive transform); still wouldn't do much for speed, but
    alas. Would also still need an entropy coder to make use of this.



    One other route seems to be sinewave synthesis, say:
    Pick the top 4 sine waves via some strategy;
    Encode the frequency and amplitude (needs ~ 16 bits IME);
    Do this ~ 100-128 times per second.
    100Hz seems to be a lower limit for intelligibility.

    This needs ~ 6.4 to 8.2 kbps, or 7.2 to 9.2 kbps if one also includes a
    byte to encode a white noise intensity.

    I had best results by taking the space from 2 to 8 kHz, dividing them
    into ~ 1/3 octaves, picking the strongest wave from each group, and then picking the top 4 strongest waves. Worked better for me to ignore lower frequencies (low frequencies seem to contain a lot of louder wave-forms,
    but which contribute little to intelligibility). In this case, waves
    between 2 and 4 kHz tend to dominate.

    Works OK for speech, but is poor for non-speech audio.
    Quality can be improved by more waves, but this quickly eats any bitrate advantage.
    Can note that while called sinewave synthesis, I also got good results
    with 3-state waves (-1, 0, 1), which are computationally preferable (wave-shape is: 1,0,-1,0).

    Can note that when used for non-speech, sinewave synthesis can have
    similar artifacts to low bitrate MP3.

    Could be pushed to lower update rates and maybe could make sense for
    basic songs (say, as a possible alternative to MIDI; which is arguably a somewhat more complex technology).

    Though, can note that for some older systems, sound effects were stored
    as variable-frequency square waves (say, for example, updating the
    square-wave frequency at 18 Hz or similar, with each frequency stored as
    a 16-bit clock-divider value or similar); along with some use of
    Delta-Sigma audio (where low-frequency delta-sigma sounds terrible).
    Neither are particularly good though.


    Though, for general audio storage (such as sound effects), some sort of
    ADPCM variant still seems preferable here.

    Though, still not yet found anything that is clearly beating 2 bit ADPCM
    for this (seemingly a still a good option for sound effects).

    And, as noted, could still get good results with ADPCM + LZMA (or
    similar), main issue being the high computational cost of the latter.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@[email protected] (Waldek Hebisch) to comp.arch on Fri Oct 24 04:10:03 2025
    From Newsgroup: comp.arch

    Michael S <[email protected]> wrote:
    On Fri, 17 Oct 2025 20:54:23 GMT
    MitchAlsup <[email protected]d> wrote:

    George Neuner <[email protected]> posted:



    Hope the attributions are correct.


    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
    <[email protected]d> wrote:


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    :
    In any case, even with these languages there are still
    software projects that fail, miss their deadlines and have
    overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out
    how to make the (17 kinds of) hammers one needs, there it little
    need to make a new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have
    been happier... The mouse was more precise in W7 than in W8 ...
    With a little upgrade for new PCIe architecture along the way
    rather than redesigning whole kit and caboodle for tablets and
    phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998,
    ... and few people would have cared. Many SW projects are driven
    not by demand for the product, but pushed by companies to make
    already satisfied users have to upgrade.

    Those programmers could have transitioned to new SW projects
    rather than redesigning the same old thing 8 more times. Presto,
    there is now enough well trained SW engineers to tackle the undone
    SW backlog.

    The problem is that decades of "New & Improved" consumer products
    have conditioned the public to expect innovation (at minimum new
    packaging and/or advertising) every so often.

    Bringing it back to computers: consider that a FOSS library which
    hasn't seen an update for 2 years likely would be passed over by
    many current developers due to concern that the project has been
    abandoned. That perception likely would not change even if the
    author(s) responded to inquiries, the library was suitable "as is"
    for the intended use, and the lack of recent updates can be
    explained entirely by a lack of new bug reports.

    LAPAC has not been updated in decades, yet is as relevant today as
    the first day it was available.


    It is possible that LAPAC API was not updated in decades, although I'd
    expect that even at API level there were at least small additions, if
    not changes. But if you are right that LAPAC implementation was not
    updated in decade than you could be sure that it is either not used by anybody or used by very few people.

    AFAICS at logical level interface stays the same. There is significant
    change: in old times you were on your own trying to interface
    Lapack from C. Now you can get C interface.

    Concerning implementation, AFAICS there are changes. Some
    improvemnts to accuracy, some to speed. But bulk of code
    stays the same. There is a lot of work on lower layer, that
    is BLAS. But the idea of Lapack was that higher level algorithms
    are portable (also in time), while lower level building blocks
    must be adapted to changing computing environment.

    There were attempt to replace Lapack by C++ templates, I do not
    see this gaining traction. There were attempts to extent Lapack
    to larger class of matrices (mostly sparse matrices), apparently
    this is less popular than lapack.

    There are attempts to automatically convert simple high level
    description of operations into high performance code. IIUC
    this has some success with FFT and few similar things, but
    currently is unable to replace Lapack.

    I would say the following: if you have good algorithm, this
    algorithm may live long. Sometimes better things are invented
    later, but if not, then old algorithm may be used quite long.
    Goal of algorithmic languages was to make portable implementation
    of algorithms. That works reasonably well, but if one aims
    at highest possible speed, then needed tweaks freqiently are
    machine specific, so good performance may be nonportable.
    In case of Lapack, it seems that there are no better algorithms
    now compared to time when Lapack was created. Performance of
    Lapack on larger matrices depends mostly on performace of
    BLAS, so there is a lot of current work on BLAS. IIUC sometimes
    Lapack routines are replaced by better performing versions,
    but most of the time gain is too small to justify the effort.

    Concerning "being used by few people": there are codes which
    are sold to a lot of users were performance or features
    matter a lot, such codes tend to evolve quickly. More
    typical is growth by adding new parts: old parts are kept
    with small changes, but new things are build on it (and
    new things independent of old thing are added). There is
    also popular "copy and mutate" approach: some parts are
    copied and them modified to provide different function
    (examples of this are drivers in an OS or new frontends
    in a compiler). However, this is partially weakness of
    programming language (it would be nicer to have clearly
    specified common part and concise specification of
    differences needed for various cases). Partly this is
    messy nature of real world. Lapack is a happly case
    when problem was quite well specified and language
    was reasonable fit for the problem. They use textual
    substitution to produce real and complex variants
    for single and double precision, so in principle
    language could do more. And certainly one could wish
    nicer and more compact description of the algorithms.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Fri Oct 24 05:56:08 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <[email protected]> schrieb:

    AFAICS at logical level interface stays the same. There is significant change: in old times you were on your own trying to interface
    Lapack from C. Now you can get C interface.

    And they got that wrong (by which I was personally bitten).
    See https://lwn.net/Articles/791393/ for a good write-up.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2