Forum: War Ensemble BBS

On Cray arithmetic

From Thomas Koenig@[email protected] to comp.arch on Sat Oct 11 10:32:22 2025

From Newsgroup: comp.arch

Just found a gem on Cray arithmetic, which (rightly) incurred
The Wrath of Kahan:

https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

"Pessimism comes less from the error-analyst's dour personality
than from his mental model of computer arithmetic."

I also had to look up "equipollent".

I assume many people in this group know this, but for those who
don't, it is well worth reading.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat Oct 11 19:36:44 2025

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> posted:

Just found a gem on Cray arithmetic, which (rightly) incurred
The Wrath of Kahan:

https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

I hope BGB reads this and takes it to heart.

"Pessimism comes less from the error-analyst's dour personality
than from his mental model of computer arithmetic."

I also had to look up "equipollent".

I assume many people in this group know this, but for those who
don't, it is well worth reading.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 12 00:28:16 2025

From Newsgroup: comp.arch

On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan ...

No harm in reminding everyone of his legendary foreword to the
Standard Apple Numerics manual, 2nd ed, of 1988. He had something
suitably acerbic to say about a great number of different vendors’
idea of floating-point arithmetic (including Cray).

I posted one instance here <http://groups.google.com/group/comp.lang.python/msg/5aaf5dd86cb00651?hl=en>. --- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 12 01:15:23 2025

From Newsgroup: comp.arch

On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

Anybody curious about what’s on pages 62-5 of the Apple Numerics Manual
2nd ed can find a copy here <https://vintageapple.org/inside_o/>.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@[email protected] to comp.arch on Sun Oct 12 04:04:46 2025

From Newsgroup: comp.arch

On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:

While the arithmetic on the Cray I was bad enough, this document seems to focus on some later models in the Cray line, which, like the IBM System/
360 when it first came out, before an urgent retrofit, lacked a guard
digit!

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 12 06:06:35 2025

From Newsgroup: comp.arch

On Sun, 12 Oct 2025 04:04:46 -0000 (UTC), John Savard wrote:

On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:

While the arithmetic on the Cray I was bad enough, this document seems
to focus on some later models in the Cray line, which, like the IBM
System/ 360 when it first came out, before an urgent retrofit, lacked a
guard digit!

The concluding part of that article had a postscript which said that,
while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@[email protected] to comp.arch on Mon Oct 13 07:23:21 2025

From Newsgroup: comp.arch

On Sun, 12 Oct 2025 06:06:35 +0000, Lawrence D’Oliveiro wrote:

The concluding part of that article had a postscript which said that,
while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.

That is a pity.

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently quoted here.

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Mon Oct 13 07:39:11 2025

From Newsgroup: comp.arch

John Savard <[email protected]d> writes:

After reading that article, I looked for more information on other >processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of >the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently >quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Mon Oct 13 09:05:18 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I think
-- to be persuaded that IEEE754 really was worth implementing in its
entirety, that the “too hard” or “too obscure” parts were there for an important reason, to make programming that much easier, and should not be skipped.

You’ll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Mon Oct 13 13:12:12 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing
in its entirety, that the “too hard” or “too obscure” parts were there for an important reason,

It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?

to make programming that much easier,
and should not be .

For many non-obvious parts of 754 it's true. For many other parts, esp.
related to exceptions, it's false.
That is, they should not be skipped, but the only reason for that is
ease of documentation (just write "754" and you are done) and access to
test vectors. This parts are not well-thought, do not make application programming any easier and do not fit well into programming languages.

You’ll notice that Kahan mentioned Apple more than once, as seemingly
his favourite example of a company that took IEEE754 to heart and
implemented it completely in software, where their hardware vendor of
choice at the time (Motorola), skimped a bit on hardware support.

According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the standard
was finished and later on were in difficult position of conflict
between compatibility wits standard vs compatibility with previous
generations.
Moto is less forgivable than Intel, because they were early adapters
not nearly as early.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Mon Oct 13 12:30:33 2025

From Newsgroup: comp.arch

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <[email protected]d> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the
rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

Though, reading some stuff, implies a predecessor chip (the R4000) had a
more functionally complete FPU. So, I guess it is also possible that the
R4300 had a more limited FPU to make it cheaper for the embedded market.

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.

Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU emulation).

Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts, so for
longer running operations it is better to switch to a handler that can
deal with interrupts (and, ATM, FDIV.Q and FSQRT.Q are kinda horridly
slow; so, less like a TLB miss, and more like a page-fault...).

The TestKern related code is getting a little behind in my GitHub repo,
idea is that these parts will be posted when they are done.

I had found/fixed one RVC bug since the last upload of the CPU core to
GitHub, but more bugs remain and are still being hunted down.

Progress is slow...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Oct 13 17:33:32 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I think -- to be persuaded that IEEE754 really was worth implementing in its entirety, that the “too hard” or “too obscure” parts were there for an
important reason, to make programming that much easier, and should not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
did make it easier--but NaNs, infinities, Underflow at the Denorm level
went in the other direction.

You’ll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Mon Oct 13 21:08:56 2025

From Newsgroup: comp.arch

On 13/10/2025 19:33, MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I think >> -- to be persuaded that IEEE754 really was worth implementing in its
entirety, that the “too hard” or “too obscure” parts were there for an
important reason, to make programming that much easier, and should not be
skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

It does not make the programs more reliable - it makes them more
consistent, predictable and portable. It does not make things easier
for most code (support for NaNs and infinities can make some code
easier, if mathematically nonsense results are a real possibility). But
since consistency, predictability and portability are often very useful characteristics, full IEEE 754 compliance is a good thing for
general-purpose processors.

However, there are plenty of more niche situations where these are not
vital, and where cost (die space, design costs, run-time power, etc.) is
more important. Thus on small microcontrollers, it can be a better
choice to skip support for the "obscure" stuff, and maybe even cutting
corners on things like rounding behaviour. The same applies for
software floating point routines for devices that don't have hardware
floating point at all.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
did make it easier--but NaNs, infinities, Underflow at the Denorm level
went in the other direction.

You’ll notice that Kahan mentioned Apple more than once, as seemingly his >> favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Mon Oct 13 21:53:33 2025

From Newsgroup: comp.arch

BGB <[email protected]> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <[email protected]d> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >> a branch delay slot, as well as using traps to implement some portions of >> the IEEE 754 standard... thus, presumably, being one of the architectures >> to inspire the piece about bad architectures from Linus Torvalds recently >> quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

And this is why FP wants high quality implementation.

Though, reading some stuff, implies a predecessor chip (the R4000) had a more functionally complete FPU. So, I guess it is also possible that the R4300 had a more limited FPU to make it cheaper for the embedded market.

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

Do it right or don't do it at all.

As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.

While I can agree with the sentiment, the emulation overhead makes this
very hard to achieve indeed.

Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU emulation).

Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts,

At the time of control arrival, interrupts are already reentrant in
My 66000. A higher priority interrupt will take control from the
lower priority interrupt.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Tue Oct 14 02:27:46 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the “too hard” or “too obscure” parts were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of “easier”.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

Denormals -- aren’t they called “subnormals” now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at
once and going straight to zero. It’s about the principle of least
surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Tue Oct 14 02:36:50 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

The hardware designers took many years -- right through the 1990s,
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the “too hard” or “too obscure” >> parts were there for an important reason,

It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?

I thought they all did, just about.

You’ll notice that Kahan mentioned Apple more than once, as
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.

According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.

Let’s see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

I would say Motorola had plenty of time to read the spec and get it
right. But they didn’t. So Apple had to patch things up in its
software implementation, introducing a mode where for example those
last few inaccurate bits in transcendentals were fixed up in software, sacrificing some speed over the raw hardware to ensure consistent
results with the (even slower) pure-software implementation.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Mon Oct 13 22:38:18 2025

From Newsgroup: comp.arch

On 10/13/2025 4:53 PM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <[email protected]d> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >>>> a branch delay slot, as well as using traps to implement some portions of >>>> the IEEE 754 standard... thus, presumably, being one of the architectures >>>> to inspire the piece about bad architectures from Linus Torvalds recently >>>> quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the
rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).

And this is why FP wants high quality implementation.

From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.

Though, reading some stuff, implies a predecessor chip (the R4000) had a
more functionally complete FPU. So, I guess it is also possible that the
R4300 had a more limited FPU to make it cheaper for the embedded market.

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

Do it right or don't do it at all.

?...

The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.

Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.

For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.

Also on RISC-V, it is more expensive to implement 128-bit arithmetic, so
the actual cost might be lower.

The main deviation from the Q extension is that it will use register
pairs rather than 128 bit registers. I suspect that likely 128-bit
registers would make more problems for software built to assume RV64G,
than the problems resulting from breaking the spec and using pairs.

Or, if the proper Q extension were supported, would make more sense in
the context of RV128, so XLEN==FLEN. Otherwise, Q on RV64 would break
the ability to move values between FPRs and GPRs (in the RV spec, they
note is the assumption that in this configuration, moves between FPRs
and GPRs would be done via memory loads and stores). This would suck,
and actively make the FPU worse than sticking primarily with the D
extension and doing something nonstandard.

As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.

While I can agree with the sentiment, the emulation overhead makes this
very hard to achieve indeed.

Will have to test this more to find out.

But, at least in the case of Binary128, the operations themselves are
likely to be slow enough to partly offset the trap-handling and
instruction decoding overheads.

Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU
emulation).

Partial issue is mostly that one doesn't want to remain in an interrupt
handler for too long because this blocks any other interrupts,

At the time of control arrival, interrupts are already reentrant in
My 66000. A higher priority interrupt will take control from the
lower priority interrupt.

Yeah, no re-entrant interrupts here.

For a longer-running operation, it is mostly needed to handle things
with a context switch into supervisor mode. Can't use the normal SYSCALL handler though, as it itself may have been the source of the trap. So, Page-Fault needs its own handler task.

It is likely that re-entrant interrupts would require a different and
likely more complex mechanism.

Well, and/or rework things at the compiler level so that the ISR proper
is only used to implement a transition into supervisor mode (or from supervisor-mode back to usermode); and then fake something more like the
x86 style interrupt handling.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Tue Oct 14 01:53:17 2025

From Newsgroup: comp.arch

Lawrence D’Oliveiro wrote:

On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

The hardware designers took many years -- right through the 1990s,
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the “too hard” or “too obscure” >>> parts were there for an important reason,

It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?

I thought they all did, just about.

You’ll notice that Kahan mentioned Apple more than once, as
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.

According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.

Let’s see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
early ones were 754 compatible, but later chips from 1986 intended
for the 386 were compatible, and they seem to have been used by many
(Motorola, Intel, Sun, PA-RISC, ...)

https://en.wikipedia.org/wiki/Weitek

Unfortunately not all the chip documents are on bitsavers

http://www.bitsavers.org/components/weitek/dataSheets/

but the WTL-1164_1165 PDF from 1986 says

FULL 32-BIT AND 64-BIT FLOATING POINT
FORMAT AND OPERATIONS, CONFORMING TO
THE IEEE STANDARD FOR FLOATING POINT ARITHMETIC

2.38 MFlops (420 ns) 32-bit add/subtract/convert and compare
1.85 MFlops (540 ns) 64-bit add/subtract/convert and compare
2.38 MFlops (420 ns) 32-bit multiply
1.67 MFlops (600 ns) 64-bit multiply
0.52 MFlops (1.92 Jls) 32-bit divide
0.26 MFlops (3.78 Jls) 64-bit divide
Up to 3.33 MFlops (300 ns) for pipelined operations
Up to 3.33 MFlops (300 ns) for chained operations
32-bit data input or 32-bit data output operation every 60 ns

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Tue Oct 14 08:30:44 2025

From Newsgroup: comp.arch

On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the “too hard” or “too obscure” parts were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of “easier”.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

Denormals -- aren’t they called “subnormals” now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. It’s about the principle of least surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

But I find it harder to understand why denormals or subnormals are going
to be useful. Ultimately, your floating point code is approximating arithmetic on real numbers. Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting
results that have such a dynamic range that you are using denormals?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)? I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code - perhaps
calculations should be re-arranged, algorithms changed, or you should be
using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Tue Oct 14 06:56:46 2025

From Newsgroup: comp.arch

On Tue, 14 Oct 2025 01:53:17 -0400, EricP wrote:

Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
early ones were 754 compatible, but later chips from 1986 intended for
the 386 were compatible, and they seem to have been used by many
(Motorola, Intel, Sun, PA-RISC, ...)

Weitek add-on cards, I think mainly the early ones, were popular with more hard-core power users of Lotus 1-2-3. Remember, that was the “killer app” that prompted a lot of people to buy the IBM PC (and compatibles) in the
first place. Some of them must have been doing some serious number-
crunching, such that floating-point speed became a real issue.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 07:51:09 2025

From Newsgroup: comp.arch

David Brown <[email protected]> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

The first two require more knowledge about FP than many programmers
have, all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant. The last increases the
resource usage much more than proper support for denormals.

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware. So that's my decision
criterion: If the software cost is higher than the hardware cost,
the software crisis is relevant; and in the present context, it
means that expending hardware to reduce the cost of software is
justified. Denormal numbers are such a feature.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Tue Oct 14 10:47:56 2025

From Newsgroup: comp.arch

On 14/10/2025 09:51, Anton Ertl wrote:

David Brown <[email protected]> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that
could be a null pointer. But as long as you are aware of the
possibility and consequences of NaNs, they can be useful.

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

I'm sure there are a number of interesting ways to model this kind of
thing, in a programming language that supported it. NaN's in floating
point are somewhat akin to error values in C++ std::expected<>, or empty std::optional<> types, or like "result" types found in many languages.

And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more useful result. And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is
because they intend to divide by it and want to avoid division by zero
(and thus infinity).

I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)

However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold? I do not know the details here - it is simply
not something that turns up in the kind of coding I do. (In my line of
work, floating point values and expression results are always "normal",
if that is the correct term. I can always use gcc's "-ffast-math", and
I think a lot of real-world floating point code could do so - but I
fully appreciate that does not apply to all code.)

Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?

perhaps
calculations should be re-arranged, algorithms changed, or you should be
using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

The first two require more knowledge about FP than many programmers
have, all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant. The last increases the
resource usage much more than proper support for denormals.

I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from
your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway before everything falls apart, but only a tiny amount. Doing it right
is going to cost you, in development time or runtime efficiency, but
that's better than getting the wrong answers quickly!

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware. So that's my decision
criterion: If the software cost is higher than the hardware cost,
the software crisis is relevant; and in the present context, it
means that expending hardware to reduce the cost of software is
justified. Denormal numbers are such a feature.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 11:26:10 2025

From Newsgroup: comp.arch

David Brown <[email protected]> writes:

On 14/10/2025 09:51, Anton Ertl wrote:

David Brown <[email protected]> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of >>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.

I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that >could be a null pointer.

Not really:

* Null pointers don't materialize spontaneously as results of
arithmetic operations. They are stored explicitly by the
programmer, making the programmer much more aware of their
existence.

* Programmers are trained to check for null pointers. And if they
forget such a check, the result usually is that the program traps,
usually soon after the place where the check should have been. With
a NaN you just silently execute the wrong branch of an IF, and later
you wonder what happened.

* The most common use for null pointers is terminating a linked list
or other recursive data structure. Programmers are trained to deal
with the terminating case in their code.

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more >useful result.

When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
certain kind of exception. Maybe you can also set it up such that it
produces a NaN instead. I doubt that many people would find that
useful, however.

And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is >because they intend to divide by it and want to avoid division by zero
(and thus infinity).

Denormals don't help much here. IEEE doubles cannot represent 2^1024,
but denormals allow to represent positive numbers down to 2^-1074.
So, with denormal numbers, the absolute value of your divisor must be
less than 2^-50 to produce a non-infinite result where flush-to-zero
would have produced an infinity.

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)

Not mine. An assumption that I like is that the associative law
holds. It holds with -fwrapv, but not with overflow-is-undefined.

I fail to see how declaring any condition undefined behaviour would
increase any guarantees.

However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?

Yes.

If any of them is a NaN, the result is false for either comparison
(because a-b would be NaN, and because the result of any comparison
with a NaN is false).

For infinity there are a number of cases

1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
5) inf<inf (false) vs. inf-inf=NaN<0 (false)
6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
7) inf<-inf (false) vs. inf--inf=inf<0 (false)
8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

The most interesting case here is 5), because if means that a<=b is
not equivalent to a-b<=0, even with denormal numbers.

Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?

I was thinking about what programmers might use when writing their
code. For compilers, having that equivalence may occasionally be
helpful for producing better code, but if it does not hold, the
compiler will just not use such an equivalence (once the compiler is
debugged).

This is an example from Kahan that stuck in my mind, because it
appeals to me as a programmer. He has also given other examples that
don't do that for me, but may appeal to a mathematician, phycisist or
chemist.

I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from >your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway >before everything falls apart, but only a tiny amount.

I think the nicer properties (such as the equivalence mentioned above)
is the more important benefit. And if you take a different branch of
an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
enough accuracy by far.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Tue Oct 14 15:37:10 2025

From Newsgroup: comp.arch

David Brown wrote:

On 14/10/2025 04:27, Lawrence Dâ€™Oliveiro wrote:

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence Dâ€™Oliveiro wrote: >>>

The hardware designers took many years -- right through the 1990s, I>>>> think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the â€œtoo hardâ€ or â€œtoo obscureâ€ parts
were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of â€œeasierâ€.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >>> in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological
results right through to the end of the calculation, in a mathematically
consistent way.

Denormals -- arenâ€™t they called â€œsubnormalsâ€ now? -- are also
about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at
once and going straight to zero. Itâ€™s about the principle of least
surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level>> with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

But I find it harder to understand why denormals or subnormals are going
to be useful. Ultimately, your floating point code is approximating arithmetic on real numbers. Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting > results that have such a dynamic range that you are using denormals? And
what are you doing where it is acceptable to lose some precision with
those numbers, but not to give up and say things have gone badly wrong > (a NaN or infinity, or underflow signal)? I have a lot of difficulty
imagining a situation where denormals would be helpful and you haven't > got a major design issue with your code - perhaps calculations should be
re-arranged, algorithms changed, or you should be using an arithmetic
format with greater range (switch from single to double, double to quad,
or use something more advanced).

Subnormal is critical for stability of zero-seeking algorithms, i.e a
lot of standard algorithmic building blocks.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Tue Oct 14 15:42:45 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

David Brown <[email protected]> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

You have just named the only common pitfall, where all comparisons
against NaN shall return false.

You can in fact define your own

bool IsNan(f64 x)
{
((x < 0.0) | (x >= 0.0)) == false
}

but this depends on the compiler/optimizer not messing up.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Tue Oct 14 17:29:40 2025

From Newsgroup: comp.arch

On 14/10/2025 13:26, Anton Ertl wrote:

David Brown <[email protected]> writes:

On 14/10/2025 09:51, Anton Ertl wrote:

David Brown <[email protected]> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be >>>> useful to have a representation for that. The defined "viral" nature of >>>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >>>> checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.

I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that
could be a null pointer.

Not really:

* Null pointers don't materialize spontaneously as results of
arithmetic operations. They are stored explicitly by the
programmer, making the programmer much more aware of their
existence.

NaN's don't materialise spontaneously either. They can be the result of intentionally using NaN's for missing data, or when your code is buggy
and failing to calculate something reasonable. In either case, the
surprise happens when someone passes the non-value to code that was not expecting to have to deal with it.

* Programmers are trained to check for null pointers. And if they
forget such a check, the result usually is that the program traps,
usually soon after the place where the check should have been. With
a NaN you just silently execute the wrong branch of an IF, and later
you wonder what happened.

Fair enough.

* The most common use for null pointers is terminating a linked list
or other recursive data structure. Programmers are trained to deal
with the terminating case in their code.

I would disagree that this is the most common use for null pointers.
But it certainly is /one/ use, and programmers should handle that usage correctly.

So to sum up, there is a certain similarity, but there are also
significant differences.

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more
useful result.

When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
certain kind of exception. Maybe you can also set it up such that it produces a NaN instead. I doubt that many people would find that
useful, however.

And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is
because they intend to divide by it and want to avoid division by zero
(and thus infinity).

Denormals don't help much here. IEEE doubles cannot represent 2^1024,
but denormals allow to represent positive numbers down to 2^-1074.
So, with denormal numbers, the absolute value of your divisor must be
less than 2^-50 to produce a non-infinite result where flush-to-zero
would have produced an infinity.

OK.

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)

Not mine. An assumption that I like is that the associative law
holds. It holds with -fwrapv, but not with overflow-is-undefined.

I fail to see how declaring any condition undefined behaviour would
increase any guarantees.

The associative law holds fine with UB on overflow, as do things like
adding a positive number to an integer makes it bigger. But this is all straying from the discussion on floating point, and I suspect that we'd
just re-hash old disagreements rather than starting new and interesting
ones :-)

However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?

Yes.

If any of them is a NaN, the result is false for either comparison
(because a-b would be NaN, and because the result of any comparison
with a NaN is false).

For infinity there are a number of cases

1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
5) inf<inf (false) vs. inf-inf=NaN<0 (false)
6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
7) inf<-inf (false) vs. inf--inf=inf<0 (false)
8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

The most interesting case here is 5), because if means that a<=b is
not equivalent to a-b<=0, even with denormal numbers.

Any kind of arithmetic with infinities is going to be awkward in some way!

Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?

I was thinking about what programmers might use when writing their
code. For compilers, having that equivalence may occasionally be
helpful for producing better code, but if it does not hold, the
compiler will just not use such an equivalence (once the compiler is debugged).

Sure.

This is an example from Kahan that stuck in my mind, because it
appeals to me as a programmer. He has also given other examples that
don't do that for me, but may appeal to a mathematician, phycisist or chemist.

Fair enough.

I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from
your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway
before everything falls apart, but only a tiny amount.

I think the nicer properties (such as the equivalence mentioned above)
is the more important benefit. And if you take a different branch of
an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
enough accuracy by far.

Well, I think that if your values are getting that small enough to make denormal results, your code is at least questionable. I am not
convinced that the equivalency you mentioned above is enough to make
denormals worth the effort, but that may be just the kind of code I
write. (And while I did study some of this stuff - numerical stability
- in my mathematics degree, it was quite a long time ago.)

Thanks for the comprehensive and educational information here. It is appreciated.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Oct 14 15:31:00 2025

From Newsgroup: comp.arch

BGB <[email protected]> posted:

On 10/13/2025 4:53 PM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <[email protected]d> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >> rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).

And this is why FP wants high quality implementation.

From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

In the above sentence I was talking about your FPU not getting
an infinitely correct result and then rounding to container size.
Not about the other "other" anomalies" many of which can be dealt
with in SW.

But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.

Though, reading some stuff, implies a predecessor chip (the R4000) had a >> more functionally complete FPU. So, I guess it is also possible that the >> R4300 had a more limited FPU to make it cheaper for the embedded market. >>

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

Do it right or don't do it at all.

?...

The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.

But their combination of HW+SW gets the right answer.
Your multiply does not.

Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.

The 5-bang instructions as used by HW+SW has to computer the result
to infinite precision and then round to container size.

The paper illustrates CRAY 1,... FP was fast but inaccurate enough
to fund an army of numerical analysists to see if the program was
delivering acceptable results.

IEEE 754 got rid of the army of Numerical Analysists.
But now, nobody remembers how bad is was/can be.

For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.

Nobody is asking for that.

<snip>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Oct 14 15:34:23 2025

From Newsgroup: comp.arch

David Brown <[email protected]> posted:

On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in >>> its entirety, that the “too hard” or “too obscure” parts were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of “easier”.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >> in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

Denormals -- aren’t they called “subnormals” now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. It’s about the principle of least surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

MAX( x, NaN ) is x.

But I find it harder to understand why denormals or subnormals are going
to be useful.

1/Big_Num does not underflow .............. completely.

Ultimately, your floating point code is approximating arithmetic on real numbers.

Don' make me laugh.

Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting results that have such a dynamic range that you are using denormals?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)? I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code - perhaps
calculations should be re-arranged, algorithms changed, or you should be using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Tue Oct 14 15:47:20 2025

From Newsgroup: comp.arch

[email protected] (Anton Ertl) posted:

David Brown <[email protected]> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

Many ISAs and many programs have trouble in getting NaNs into the
ELSE-clause. One cannot use deMorgan's Law to invert conditions in
the presence of NaNs.

We (Brain, Thomas and I) went to great pain to have FCMP deliver a
bit pattern where one could invert the condition AND still deliver
the NaN to the expected Clause. We threw in Ordered and Totally-
Ordered at the same time, along with OpenCL FP CLASS() function.

Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly >wrong (a NaN or infinity, or underflow signal)?

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

The worst of all possible results is no information whatsoever.

I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to >double, double to quad, or use something more advanced).

The first two require more knowledge about FP than many programmers
have,

Don't allow THOSE programmers to program FP codes !!
Get ones that understand the nuances.

all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant.

Windows 7 and Office 2003 were good enough. That would have allowed
zillions of programmers to go address the software crisis after being
freed from projects that had become good enough not to need continual
work.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Tue Oct 14 16:48:50 2025

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> schrieb:

David Brown <[email protected]> posted:

Ultimately, your floating point code is approximating
arithmetic on real numbers.

Don' make me laugh.

Somebody (not me) recently added the following to the gcc bugzilla
quip file:

The "real" type in fortran is called "real" because the
mathematician should not notice that it has finite decimal places
and forget that one needs lenghty adaptions of the proofs for
that....
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 16:46:03 2025

From Newsgroup: comp.arch

David Brown <[email protected]> writes:

The associative law holds fine with UB on overflow,

With 32-bit ints:

The result of (2000000000+2000000000)+(-2000000000) is undefined.

The result of 2000000000+(2000000000+(-2000000000)) is 2000000000.

So, the associative law does not hold.

With -fwrapv both are defined to produce 2000000000, and the
associative law holds because modulo arithmetic is associative.

Well, I think that if your values are getting that small enough to make >denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Tue Oct 14 17:26:16 2025

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> writes:

[email protected] (Anton Ertl) posted:

[...]

Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.

That may be a good idea. You can write it in current languages as
follows:

if (a<b) {
...
} else if (a>=b) {
...
} else {
... NaN case ...
}

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

I don't think anything about FCMP. What I wrote above is about
programming languages. I.e., a<b would trap if a or b is a NaN, while lt_or_nan(a,b) would be true if a or b is a NaN, and
lt_and_not_nan(a,b) would be false if a or b is a NaN. I think the
IEEE754 people have better names for these comparisons, but am too
lazy to look them up.

The first two require more knowledge about FP than many programmers
have,

Don't allow THOSE programmers to program FP codes !!
Get ones that understand the nuances.

We can all wish for Kahan writing all FP code, but that only deepens
the software crisis. Educating programmers is certainly a worthy
undertaking, but providing a good foundation for them to build on
helps those programmers as well as those that are less educated.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Tue Oct 14 12:45:08 2025

From Newsgroup: comp.arch

On 10/14/2025 10:31 AM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 10/13/2025 4:53 PM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <[email protected]d> writes:

After reading that article, I looked for more information on other >>>>>> processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist >>>>> probably the best-known example (denormal support only in software), >>>>> and Linus Torvalds worked on it personally. Concerning exposing the >>>>> pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >>>> rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so >>>> a lot of the N64 games suffer drift and other issues over time (as the >>>> programmers had compensated for the MIPS issues in code rather than via >>>> traps).

And this is why FP wants high quality implementation.

From what I gather, it was a combination of Binary32 with DAZ/FTZ and
truncate rounding. Then, with emulators running instead on hardware with
denormals and RNE.

In the above sentence I was talking about your FPU not getting
an infinitely correct result and then rounding to container size.
Not about the other "other" anomalies" many of which can be dealt
with in SW.

This mostly applies to FMUL, but:
I had already added a trap case for this as well.

In the cases where the all the low-order bits of either input are 0,
then the low-order results would also be 0 and so are N/A (the final
result would be the same either way).

If both sets of low-order bits are non-zero, it can trap.
This does mean that the software emulation will need to provide a full
width result though.

Checking for non-zero here being more cost-effective than actually doing
a full width multiply.

Also, RISC-V FMADD.D and similar are sorta also going to end up as traps
due to the lack of single-rounded FMA (though had debated whether to
have a separate control-flag for this to still allow non-slow FMADD.D
and similar; but as-is, these will trap).

For FADD:
The shifted-right bits that fall off the bottom (of the slightly-wider internal mantissa) don't matter, since they were always being added to
0, which can't generate any carry.

For FSUB, it may matter, but more in the sense that one can check
whether the "fell off the bottom" part had non-zero bits and use this to adjust the carry-in part of the subtractor (since non-zero bits would
absorb the carry-propagation of adding 1 to the bottom of a
theoretically arbitrarily wide twos complement negation).

So, in theory, can be dealt with in hardware to still give an exact result.

There are still some sub-ULP bits, so the complaints about the lack of
guard bit doesn't really apply.

Also apparently the Cray used a non-normalized floating point format (no hidden bit), which was odd (and could create its own issues).

Though, potentially a non-normalized format with lax normalization could
allow for cheaper re-normalization (even if it could require
re-normalization logic for FMUL). Though, for such a format, there is
the possibility that someone could make re-normalization be its own instruction (allowing for an FPU with less latency).

But, the result was that the games would work correctly on the original
hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.

Though, reading some stuff, implies a predecessor chip (the R4000) had a >>>> more functionally complete FPU. So, I guess it is also possible that the >>>> R4300 had a more limited FPU to make it cheaper for the embedded market. >>>>

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs) >>>> Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either. >>>

Do it right or don't do it at all.

?...

The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.

But their combination of HW+SW gets the right answer.
Your multiply does not.

As noted above, I was already working on this.

Or, is the argument here that sticking with weaker not-quite IEEE FPU is
preferable to using trap handlers.

The 5-bang instructions as used by HW+SW has to computer the result
to infinite precision and then round to container size.

The paper illustrates CRAY 1,... FP was fast but inaccurate enough
to fund an army of numerical analysists to see if the program was
delivering acceptable results.

IEEE 754 got rid of the army of Numerical Analysists.
But now, nobody remembers how bad is was/can be.

OK.

As can be noted, for scalar operations I consider there to be a limit as
to how bad is "acceptable".

For SIMD operations, it is a little looser.
For example, the ability to operate on integer values and get exact
results is basically required for scalar operations, but optional for SIMD.

Though, in this case it is a case of both Quake and also some JavaScript
VMs relying on the ability to express integer values as floating-point
numbers and use them in calculations as such (so, for example, if the operations don't give exact results then the programs break).

For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.

Nobody is asking for that.

OK.

Can note that in my looking, it seems like:
Pretty much none of the ASIC implementations support the Q extension;
It is not required in any of the mainline profiles;
Implementing Q proper would have non-zero impact on RV64G:
The differences between F+D and F+D+Q being non-zero.
Whereas, "fudging it" can retain strict compatibility with D.
Where, people actually use 'D'.

There is a non-zero amount of code using "long double", but in this case
the bigger issue is more the code footprint of the associated
long-double math functions rather than performance (say, if someone uses "cosl()" or similar).

Still not ideal, as (with my existing ISA extensions) there is still no single-instruction way to load a 64-bit value into an FPR.

But, could at least reduce it from 11 (44 bytes) instructions to 3 (20
bytes; "LI-Imm33; SHORI-Imm32; FMV.D.X"). This still means 40 bytes to
load a full-width Binary128 literal.
Loading the same literal would need 24 bytes in XG3.
And, an unrolled Taylor expansion uses a lot of them.

With Q proper? Only option would be to use memory loads here.
Like, these the C math functions are annoyingly bulky in this case.

Meanwhile, elsewhere saw a mention that apparently to deal with RISC-V fragmentation issues, there is now being work on a mechanism to allow modification of the RISC-V instruction listings in GCC without needing
to modify the code in GCC proper each time (basically hot injecting
stuff into the instruction listing and similar).

As apparently having everyone trying to modify the ISA every which way
is making a bit of an awful mess of things.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 03:45:31 2025

From Newsgroup: comp.arch

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.

The “crisis” was supposed to do with the shortage of programs to write all the programs that were needed to solve business and user needs.

By that definition, I don’t think the “crisis” exists any more. It went away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 03:47:14 2025

From Newsgroup: comp.arch

On Tue, 14 Oct 2025 15:47:20 GMT, MitchAlsup wrote:

Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.

All the good languages have IEEE754 compliant arithmetic libraries,
including type queries for things like isnan().

E.g. <https://docs.python.org/3/library/math.html>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Wed Oct 15 05:55:40 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.

The "crisis" was supposed to do with the shortage of programs to write all >the programs that were needed to solve business and user needs.

I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.

By that definition, I don’t think the "crisis" exists any more. It went >away with the rise of very-high-level languages, from about the time of >those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.

In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Wed Oct 15 12:41:28 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 05:55:40 GMT
[email protected] (Anton Ertl) wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.

The "crisis" was supposed to do with the shortage of programs to
write all the programs that were needed to solve business and user
needs.

I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.

By that definition, I don’t think the "crisis" exists any more. It
went away with the rise of very-high-level languages, from about the
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.

In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me.
However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of
Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Wed Oct 15 12:36:17 2025

From Newsgroup: comp.arch

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <[email protected]> writes:

Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point, ignoring
NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y. Then - again in the mathematical real domain - the operation is carried out. Then the
result is truncated or rounded to fit back within the mantissa and
exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^ -308
to 10 ^ +308, or 716 orders of magnitude. (For comparison, the size of
the universe measured in Planck lengths is only about 61 orders of
magnitude.)

Denormals let you squeeze a bit more at the lower end here - another 16
orders of magnitude - at the cost of rapidly decreasing precision. They
don't stop the inevitable approximation to zero, they just delay it a
little.

I am still at a loss to understand how this is going to be useful - when
will that small extra margin near zero actually make a difference, in
the real world, with real values? When you are using your
Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially
when these smaller numbers have lower precision?

I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and there
you are looking to get as high an accuracy as you can in the
intermediary steps so that you can continue for longer. But even there, denormals are not going to give you more than a tiny amount extra.

(There are, of course, mathematical problems which deal with values or precisions far outside anything of relevance to the physical world, but
if you are dealing with those kinds of tasks then IEEE floating point is
not going to do the job anyway.)

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Wed Oct 15 12:54:30 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

David Brown <[email protected]> posted:

On 14/10/2025 04:27, Lawrence Dâ€™Oliveiro wrote:>>> On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence Dâ€™Oliveiro wrote:

The hardware designers took many years -- right through the 1990s, I >>>>> think -- to be persuaded that IEEE754 really was worth implementing in >>>>> its entirety, that the â€œtoo hardâ€ or â€œtoo obscureâ€ parts were there for
an important reason, to make programming that much easier, and should >>>>> not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs >>>> more reliable (more numerically stable) and to give the programmer a>>>> constant programming model (not easier).

As a programmer, I count all that under my definition of â€œeasierâ€.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did >>>> make it easier--but NaNs, infinities, Underflow at the Denorm level went >>>> in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological >>> results right through to the end of the calculation, in a mathematically >>> consistent way.

Denormals -- arenâ€™t they called â€œsubnormalsâ€ now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at >>> once and going straight to zero. Itâ€™s about the principle of least >>> surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

MAX( x, NaN ) is x.

That was true under 754-2008 but we fixed it for 2019: All NaNs
propagate through the new min/max definitions. The old still exist of
course, but they are deprecated.
The point that made it obvious to everyone was that under the 2008
definition an SNaN would always propage, but be converted to a QNaN, but a QNaN could silently disappear as show above.
What this meant was that for any kind of vector reduction, the final
result could be the NaN or any of the other input values, depending upon the order of the individual comparisons!
I was one of the proponents who pushed this change through, but I will
say that after we showed some of the most surprising results, everyone
agreed to fix it. Having NaN maximally sticky is also definitely in the
spirit of the entire 754 standard:
The only operations that do not propagate NaN are those that explicitly
handle this case, or those that don't return a floating point value.
Having all compares return 'false' is an example of the latter.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Wed Oct 15 13:07:01 2025

From Newsgroup: comp.arch

David Brown wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <[email protected]> writes:

Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point, ignoring
NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y. Then - again in the mathematical real domain - the operation is carried out. Then the
result is truncated or rounded to fit back within the mantissa and
exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^ -308
to 10 ^ +308, or 716 orders of magnitude. (For comparison, the size of
the universe measured in Planck lengths is only about 61 orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another 16 orders of magnitude - at the cost of rapidly decreasing precision. They don't stop the inevitable approximation to zero, they just delay it a little.

I am still at a loss to understand how this is going to be useful - when will that small extra margin near zero actually make a difference, in
the real world, with real values? When you are using your
Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially when these smaller numbers have lower precision?

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of the
reaons being that I tend to not write infinite loops inside functions,
instead I'll pre-calculate how many (typically NR) iterations should be needed.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Wed Oct 15 16:50:13 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 12:36:17 +0200
David Brown <[email protected]> wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <[email protected]> writes:

Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the solution, but that's not going to improve the accuracy of the
results.

Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?

I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.

(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)

I don't think that I agree with Anton's point, at least as formulated.

Yes, subnormals improve precision of Newton-Raphson and such*, but only
when the numbers involved in calculations are below 2**-971, which does
not happen very often. What is more important that *when* it happens
then naively written implementations of such algorithms still converge.
Without subnormals (or without expert provisions) there is big chance
that they would not converge at all. That happens mostly because
IEEE-754 preserves following intuitive invariant:
When x > y then x - y > 0
Without subnormals, e.g. with VAX float formats that are otherwise
pretty good, this invariant does not hold.

* - I personally prefer to illustrate it with cord-and-tangent
root-finding algorithm that can be used for any type of function as
long as you proved that on section of interest there is no change of
sign of its first and second derivatives. May be, because I
was taught this algorithm at age of 15. This algo can be called
half-Newton].

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Wed Oct 15 17:46:21 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 13:07:01 +0200
Terje Mathisen <[email protected]> wrote:

David Brown wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <[email protected]> writes:

Well, I think that if your values are getting that small enough
to make denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.� Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.

Feel free to correct me if what I write below is wrong - you,
Terje, and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out.� Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.� For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude.� (For comparison,
the size of the universe measured in Planck lengths is only about
61 orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here -
another 16 orders of magnitude - at the cost of rapidly decreasing precision.� They don't stop the inevitable approximation to zero,
they just delay it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values?� When you are
using your Newton-Raphson iteration to find your function's zeros,
what are the circumstances in which you can get a more useful end
result if you continue to 10 ^ -324 instead of treating 10 ^ -308
as zero - especially when these smaller numbers have lower
precision?

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least
some zero-seeking algorithms will stabilize on an exact value, if and
only if you have subnormals, otherwise it is possible to wobble back
& forth between two neighboring results.

I.e. they differ by exactly one ulp.

As I noted, I have not been bitten by this particular issue, one of
the reaons being that I tend to not write infinite loops inside
functions, instead I'll pre-calculate how many (typically NR)
iterations should be needed.

Terje

It does not sound right to me. Newton-alike iterations oscillations by
1 ULP could happen even with subnormals. They should be taken care of by properly written exit conditions.
What could happen without subnormals are oscillations by *more* than 1
ULP, sometimes much more.
Also in absence of subnormals one can suffer divisions by zero in code
like below:
while (fb > fa) {
a -= b*fa/(fb - fa);
...
}
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Wed Oct 15 16:53:33 2025

From Newsgroup: comp.arch

On 15/10/2025 13:07, Terje Mathisen wrote:

David Brown wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <[email protected]> writes:

Well, I think that if your values are getting that small enough to make >>>> denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y. Then
- again in the mathematical real domain - the operation is carried
out. Then the result is truncated or rounded to fit back within the
mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just delay
it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I.e. they differ by exactly one ulp.

I have no problems believing that this can occur on occasion. No matter
what range you pick for your floating point formats, or what precision
you pick, you will always be able to find examples of this kind of
algorithm that home in on the right value with the format you have
chosen but would fail with just one bit less. I just don't think that
such pathological examples mean that subnormals are important.

But if such cases occur regularly in real-world calculations, not just artificial examples, then it's a different matter.

As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Wed Oct 15 17:52:48 2025

From Newsgroup: comp.arch

On 15/10/2025 15:50, Michael S wrote:

On Wed, 15 Oct 2025 12:36:17 +0200
David Brown <[email protected]> wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <[email protected]> writes:

Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.

Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?

I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the
intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.

(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)

I don't think that I agree with Anton's point, at least as formulated.

Yes, subnormals improve precision of Newton-Raphson and such*, but only
when the numbers involved in calculations are below 2**-971, which does
not happen very often. What is more important that *when* it happens
then naively written implementations of such algorithms still converge. Without subnormals (or without expert provisions) there is big chance
that they would not converge at all. That happens mostly because
IEEE-754 preserves following intuitive invariant:
When x > y then x - y > 0
Without subnormals, e.g. with VAX float formats that are otherwise
pretty good, this invariant does not hold.

I can appreciate that you can have x > y, but with such small x and y
and such close values that (x - y) is a subnormal - thus without
subnormals, (x - y) would be 0.

Perhaps I am being obtuse, but I don't see how you would write a Newton-Raphson algorithm that would fail to converge, or fail to stop,
just because you don't have subnormals. Could you give very rough
outline of such problematic code?

* - I personally prefer to illustrate it with cord-and-tangent
root-finding algorithm that can be used for any type of function as
long as you proved that on section of interest there is no change of
sign of its first and second derivatives. May be, because I
was taught this algorithm at age of 15. This algo can be called
half-Newton].

I was perhaps that age when I first came across Newton-Raphson in a
maths book, and wrote an implementation for it on a computer. That was
in BBC Basic, and I'm pretty sure that the floating point type there was
not IEEE compatible, and did not support such fancy stuff as subnormals!
But I am also very sure I did not push the program to more difficult examples. (But it did show nice graphic illustrations of what it was
doing.)

It was also around then that I wrote a program for matrix inversion, and discovered the joys of numeric instability, and thus the need for care
when picking the order for Gaussian elimination.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Wed Oct 15 13:22:01 2025

From Newsgroup: comp.arch

Michael S wrote:

On Wed, 15 Oct 2025 05:55:40 GMT
[email protected] (Anton Ertl) wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.

The "crisis" was supposed to do with the shortage of programs to
write all the programs that were needed to solve business and user
needs.

I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.

By that definition, I donג��t think the "crisis" exists any more. It >>> went away with the rise of very-high-level languages, from about the
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.

In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the
difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

Because there is no branch, there is no way to speculate around the check
(but load value speculation could negate this fix).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:09:27 2025

From Newsgroup: comp.arch

[email protected] (Anton Ertl) posted:

----------------------------

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost?

Most people would say:: "When it adds performance" AND the compiler
can use it. Some would add: "from unmodified source code"; but I
am a little wishy-washy on the last clause.

I might note that SIMD obeys none of the 3 conditions.

When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Printf-family "closes more of the gap" than EDIT ever could. And there
is a whole suite of things better off left in subroutines than being
raised into Instructions.

Unfortunately, elementary FP functions are no longer in that category.
When one can perform SIN(x) along with argument reduction and polynomial calculation in the cycle time of FDIV, SIN() deserves to be a first
class member of the instruction set--especially if the HW cost is
"not that much".

On the other hand: things like polynomial evaluating instructions
seem a bridge too far as you have to pick for all time 1 of {Horner,
Estrin, Padé, Power Series, Clenshaw, ...} and at some point it
becomes better to start using FFT-derived evaluation means.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Arguably, the best thing to do here is to Trap on the creation of deNorms.
At least then you can see them and do something about them at the algorithm level. {Gee Whiz Cap. Obvious: IEEE 754 already did this!}

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

My 66000 is immune from Spectré; µA state is not updated until retire.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

We just don't have the smoking gun of a missing $1M-to-$1B to make it
worth the effort to do something about it. But mark my words:: the vulnerability is being exploited ...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:13:53 2025

From Newsgroup: comp.arch

Michael S <[email protected]> posted:

-------------------------------

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.

My 66000 allows an application to crap all over "the stack";
but it does provide a means whereby "crapping all over the stack"
does not allow the application to violate the contract between caller
and callee. Once application performs a RET (or EXIT) control is returns
to caller 1 instruction past calling point, and with the preserved
registers preserved !

W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:28:52 2025

From Newsgroup: comp.arch

EricP <[email protected]> posted:
---------------------------

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

Yes, order in OoO is sanity-impairing.

But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your µA does not update µA state prior to retire,
you can be as OoO as you like and still not be Spectré sensitive.

One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.

My idea is for the CHKcc instruction to copy the test value to a dest register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.

Because there is no branch, there is no way to speculate around the check (but load value speculation could negate this fix).

x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and µfaults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at µA level.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 21:34:14 2025

From Newsgroup: comp.arch

Terje Mathisen <[email protected]> posted:
----------------------

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.

I.e. they differ by exactly one ulp.

In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more
accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

Almost always the right course of events.

The W() function may be different. W( poly×(e^poly) ) = poly.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 21:37:42 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

I might note that SIMD obeys none of the 3 conditions.

I believe GCC can do auto-vectorization in some situations.

But the RISC-V folks still think Cray-style long vectors are better than
SIMD, if only because it preserves the “R” in “RISC”.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Wed Oct 15 21:42:32 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence D’Oliveiro wrote:

The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.

By that definition, I don’t think the "crisis" exists any more. It went
away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.

Correct. That does seem to be a key part of what “very-high-level” means.

There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.

What we’re seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

In any case, even with these languages there are still software projects
that fail, miss their deadlines and have overrun their budget ...

I’m not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key
point about using such a very-high-level language is you can do a lot in
just a few lines of code.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 22:19:18 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

I might note that SIMD obeys none of the 3 conditions.

I believe GCC can do auto-vectorization in some situations.

Yes, 28 YEARS after it was first put in !! it danged better be
able !?! {yes argue about when}

My point was that you don't put it in until you can see a performance
advantage in the very next (or internal) compiler. {Where 'you' are
the designers of that generation.

But the RISC-V folks still think Cray-style long vectors are better than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors
(or vice versa)--they simply represent different ways of shooting
yourself in the foot.

No ISA with more than 200 instructions deserves the RISC mantra.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Wed Oct 15 22:31:32 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence D’Oliveiro wrote:

The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.

By that definition, I don’t think the "crisis" exists any more. It went >> away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.

Correct. That does seem to be a key part of what “very-high-level” means.

There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.

What we’re seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

There is a pernicious trap:: once an application written in a VHLL
is acclaimed by the masses--it instantly falls into the trap where
"users want more performance":: something the VHLL cannot provide
until they.........

45 years ago it was LISP, you wrote the application in LISP to figure
out the required algorithms and once you got it working, you rewrote
it in a high-performance language (FORTRAN or C) so it was usably fast.

History has a way of repeating itself, when no-one remembers the past.

In any case, even with these languages there are still software projects that fail, miss their deadlines and have overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out how to
make the (17 kinds of) hammers one needs, there it little need to make a
new hammer architecture.

Windows could have stopped at W7, and many MANY people would have been happier... The mouse was more precise in W7 than in W8 ... With a little upgrade for new PCIe architecture along the way rather than redesigning
whole kit and caboodle for tablets and phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand
for the product, but pushed by companies to make already satisfied users
have to upgrade.

Those programmers could have transitioned to new SW projects rather than redesigning the same old thing 8 more times. Presto, there is now enough
well trained SW engineers to tackle the undone SW backlog.

I’m not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key point about using such a very-high-level language is you can do a lot in just a few lines of code.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Thu Oct 16 05:44:04 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the “R” stands for.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Thu Oct 16 05:57:34 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup wrote:

On Wed, 15 Oct 2025 21:42:32 -0000 (UTC), Lawrence D’Oliveiro wrote:

What we’re seeing here is a downward creep, as those very-high-level
languages (Python and JavaScript, particularly) are encroaching into
the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels,
otherwise we might as well use the latter.

There is a pernicious trap:: once an application written in a VHLL is acclaimed by the masses--it instantly falls into the trap where "users
want more performance":: something the VHLL cannot provide until they.........

45 years ago it was LISP, you wrote the application in LISP to figure
out the required algorithms and once you got it working, you rewrote it
in a high-performance language (FORTRAN or C) so it was usably fast.

No, you didn’t. There is a Pareto rule in effect, in that the majority of the CPU time (say, 90%) is spent in a minority of the code (say, 10%). So having got your prototype working, and done suitable profiling to identify
the bottlenecks, you concentrate on optimizing those bottlenecks, not on rewriting the whole app.

Paul Graham (well-known LISP guru) described how the company he was with
-- one of the early Dotcom startups -- wrote Orbitz, an airline
reservation system, in LISP. But the most performance critical part was
done in C++.

Nowadays, with the popularity of Python, we already have lots of efficient lower-level toolkits to take care of common tasks, taking advantage of the versatility of the core Python language. For example, NumPy for handling serious number-crunching: you write a few lines of Python, to express a high-level operation that crunches a million sets of numbers in just a few seconds.

Maybe it only took you a minute to come up with the line of code; maybe
you will never need to run it again. Writing a program entirely in FORTRAN
or C to perform the same operation might take an expert programmer an hour
or two, say; in that time, the Python programmer could try out dozens of similar operations, maybe discard the results of three quarters of them,
to narrow down the important information to be extracted from the raw
data.

That’s the kind of productivity gain we enjoy nowadays, on a routine
basis, without making a big deal about it in news headlines. And that’s
why we don’t talk about a “software crisis” any more.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Thu Oct 16 09:04:23 2025

From Newsgroup: comp.arch

On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers. With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are. I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.

Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the “R” stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Thu Oct 16 07:00:58 2025

From Newsgroup: comp.arch

Michael S <[email protected]> writes:

The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
browser.

As for the browsers, AFAIK they tried to make Spectre leak less by
making the clock less precise. That does not stop Spectre, it only
makes data extraction using the clock slower. Moreover, there are
ways to work around that by running a timing loop, i.e., instead of
the clock you use the current count of the counted loop.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.

What do you mean with "mitigated in hardware"? The answers to
hardware vulnerabilities are to either fix the hardware (for Spectre
"invisible speculation" looks the most promising to me), or to leave
the hardware vulnerable and mitigate the vulnerability in software
(possibly supported by hardware or firmware changes that do not fix
the vulnerability).

So do you not want it to be fixed in hardware, or not mitigated in
software? As long as the hardware is not fixed, you may not have a
choice in the latter, unless you use an OS you write yourself. AFAIK
you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations
disabled.

So if you are against hardware fixes, you will pay for software
mitigations, in development cost (possibly indirectly) and in
performance.

More info on the topic:

Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Thu Oct 16 11:34:20 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <[email protected]> posted:
----------------------

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.

I.e. they differ by exactly one ulp.

In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate
lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Thu Oct 16 10:24:37 2025

From Newsgroup: comp.arch

David Brown wrote:

On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the “R” stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

Looking at

The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

he never says what defines RISC, just what improved results
this *design approach* should achieve.

"Several factors indicate a Reduced Instruction Set Computer as a
reasonable design alternative.
...
Implementation Feasibility. A great deal depends on being able to fit
an entire CPU design on a single chip.
...
[EricP: reduced absolute amount of logic for a minimum implementation]

Design Time. Design difficulty is a crucial factor in the success of
VLSI computer.
...
[EricP: reduced complexity leading to reduced design time]

Speed. The ultimate test for cost-effectiveness is the speed at which an implementation executes a given algorithm. Better use of chip area and availability of newer technology through reduced debugging time contribute
to the speed of the chip. A RISC potentially gains in speed merely from a simpler design.
...
[EricP: reduced complexity and logic leads to reduced critical
path lengths giving increased frequency.]

Better use of chip area. If you have the area, why not implement the CISC?
For a given chip area there are many tradeoffs for what can be realized.
We feel that the area gained back by designing a RISC architecture rather
than a CISC architecture can be used to make the RISC even more attractive
than the CISC. ... When the CISC becomes realizable on a single chip,
the RISC will have the silicon area to use pipelining techniques;
when the CISC gets pipelining the RISC will have on chip caches, etc.
...
[EricP: reduced waste on dragging around architectural boat anchors]

The experience we have from compilers suggests that the burden on compiler writers is eased when the instruction set is simple and uniform.
...
[EricP: reduced compiler complexity and development work]
"

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Thu Oct 16 10:32:21 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

EricP <[email protected]> posted:
---------------------------

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me.
However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of
Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

Yes, order in OoO is sanity-impairing.

But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your µA does not update µA state prior to retire,
you can be as OoO as you like and still not be Spectré sensitive.

One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.

My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register
write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.

My idea is the same as a SUB instruction with overflow detect,
which I would already have. I like cheap solutions.

But the core idea here, to eliminate a control flow race condition by
changing it to a data flow dependency, may be applicable in other areas.

Because there is no branch, there is no way to speculate around the check
(but load value speculation could negate this fix).

On second thought, no, load value speculation would not negate this fix.

x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and µfaults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at µA level.

I'd prefer not to step in that cow pie to begin with.
Then I won't have to spend time cleaning my shoes afterwards.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Thu Oct 16 23:04:44 2025

From Newsgroup: comp.arch

On Thu, 16 Oct 2025 07:00:58 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
browser.

As for the browsers, AFAIK they tried to make Spectre leak less by
making the clock less precise. That does not stop Spectre, it only
makes data extraction using the clock slower. Moreover, there are
ways to work around that by running a timing loop, i.e., instead of
the clock you use the current count of the counted loop.

I don't think that it was a primary mitigation of Spectre Variant 1
implemented in browsers.
Indeed, they made clock less precise, but that was their secondary
line of defense, mostly aimed at new SPECTRE variants that are not
discovered yet.
For Spectre Variant 1 they implemented much more direct defense.
For example, before mitigation JS statement val = x[i] was compiled to:
cmp %RAX, 0(%RDX) # compare i with x.limit
jbe oob_handler
mov 8(%RDX, %RAX, 4), %RCX
After mitigation it looks like:
xor %ECX, %ECX
cmp %RAX, 0(%RDX) # compare i with x.limit
jbe oob_handler
movbe %ECX, %EAX # data dependency prevents problematic speculation
mov 8(%RDX, %RAX, 4), %RCX

Almost identical code could be generated on ARM or POWER or SPARC. On
MIPS rev6 it could be even shorter. On non-extended RISC-V it would be
somewhat longer, but browser vendors do not care about RISC-V, extended
or not.

The part above written for the benefit of interested bystanders.
You already know all that.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.

What do you mean with "mitigated in hardware"? The answers to
hardware vulnerabilities are to either fix the hardware (for Spectre "invisible speculation" looks the most promising to me), or to leave
the hardware vulnerable and mitigate the vulnerability in software
(possibly supported by hardware or firmware changes that do not fix
the vulnerability).

So do you not want it to be fixed in hardware, or not mitigated in
software? As long as the hardware is not fixed, you may not have a
choice in the latter, unless you use an OS you write yourself. AFAIK
you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations disabled.

So if you are against hardware fixes, you will pay for software
mitigations, in development cost (possibly indirectly) and in
performance.

More info on the topic:

Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

- anton

May be, I'll look at it some day. Certainly not tonight.
May be, never.
After all, neither me nor you are experts in design of modern high perf
CPUs. So our reasonings about performance impact of this or that HW
solution are at best educated hand wavings.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Thu Oct 16 15:17:22 2025

From Newsgroup: comp.arch

On 10/16/2025 12:44 AM, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.

With some fighting as to what exactly it means:
Small Listing (or smallest viable listing);
Simple Instructions (Eg: Load/Store);
Fixed-size instructions;
...

So, for RISC-V:
First point only really holds in the case of RV64I.
For RV64G, there is already a lot of unnecessary stuff in there.
Second Point:
Fails with the 'A' extension;
Also parts of F/D.
Third Point:
Fails with RV-C.
Though, people redefine it:
Still RISC so long as not using an x86-style encoding scheme.

Well, and still the past example of some old marketing for MSP430 trying
to pass it off as a RISC, where it had more in common with PDP-11 than
with any of the RISC's (and only reason listing looks tiny is by
ignoring the special cases encoded in certain combinations of registers
and addressing modes).

Like, you can sweep things like immediate-form instructions when you can
do "@PC+" and get the same effect.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

RISC-V tends to fail at this one in some areas...

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

The P extension is also a fail in this area, as they went whole-hog in defining new instructions for nearly every possible combination.

Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.

IME, SIMD tends to primarily show benefits with 2 and 4 element vectors.

Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Also, element sizes:
Most of the dominant use-cases seem to involve 16 and 32 bit elements.
Most cases that involve 8 bit elements are less suited to actual
computation at 8 bits (for example, RGB math often works better at 16 bits).

There are some weaknesses, for example, I mostly ended up dealing with
RGB math by simply repeating the 8-bit values twice within a 16-bit spot.

For various tasks, it might has been better to have gone with an
unpack/repack scheme like:
Pad2.Value8.Frac6
Pad4.Value8.Frac4
Where Pad can deal with values outside unit range, and Frac with values between the two LDR points. Then the RGB narrowing conversion operations
could have had the option for round-and-saturate.

Though, a more tacky option is to use the existing unpack operation and
then invert the low-order bits to add a little bit of padding space for underflow/overflow.

Another option being to use "Packed Shift" instructions to get a format
with pad bits.

No saturating ops in my case, as saturating ops didn't seem worth it
(and having Wrap/SSat/USat/... is a big part of the combinatorial
explosion seen in the P extension).

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the “R” stands for.

Checking, if I take XG3, and exclude SIMD, 128-bit integer instructions,
stuff for 96-bit addressing, etc, the listing drops to around 208 instructions.

This does still include things like instructions with niche addressing
modes (such as "(GP,Disp16)"), etc.

If stripped back to "core instructions" (excluding rarely-used
instructions, such as ROT*/etc, and some of these alternate-mode
instructions, etc), could be dropped back a little further.

There are some instructions in the listing that would have been merged
in RISC-V, like FPU instructions which differ only in rounding mode (the
RNE and DYN instructions exist as separate instructions in this case, ...).

It is a little over 400 if the SIMD and ALUX stuff and similar is added
back in (excluding things like placeholder spots, or instructions which
were copied from XG2 but are either N/A or redundant, ...).

There is a fair chunk of instructions which mostly exist as SIMD format converters and similar.

So, seems roughly:
~ 50%: Base instructions
~ 20%: ALUX and 96-bit addressing.
~ 30%: SIMD stuff

Internally to the CPU core, there are roughly 44 core operations ATM,
though many multiplex groups of related operations as sub-operations.

So, things like ALU/CONV/etc don't represent a single instruction.
But, JMP/JSR/BRA/BSR are singular operations (and BRA/BSR both map to
JAL on the RV side, differing as to whether Rd is X0 or X1; similarly
with both JMP and JSR mapping to JALR in a similar way).

BSR and JSR had been modified to allow arbitrary link register, but it
may make sense to reverse this; as Rd other than X0 and X1 is seemingly
pretty much never used in practice (so not really worth the logic cost).

Other option being to trap and (potentially) emulate, if Rd is not X0 or
X1 (or just ignore it). Also, very possible, is demoting basically the
entire RV 'A' extension to "trap and emulate".

So, in HW:
RV64I : Fully
M : Mostly
A : Trap/Emulate
F/D : Partial (many cases are traps)
Zicsr : Partial (trap in general case)
Zifence: Trap
...

where, say, ALU gets a 6-bit control value:
(3:0): Which basic operation to perform;
(5:4): In one of several ways:
00: 32-bit, sign-ext result (eg: ADDW in RV terms)
01: 32-bit, zero-ext result (eg: ADDWU in RV terms)
10: 64-bit (ADD)
11: 2x 32-bit (some ops) or 4x 16-bit (some other ops)
PADD.L or PADD.W.

There is CONV/CONV2/CONV3:
CONV: Simple 2R converter ops which may have 1-cycle latency
(later demoted to 2-cycle, with moV being relocated elsewhere).
CONV2: More complex 2R converter ops, 2 cycle latency.
CONV3: Same as CONV2, but because CONV2 ran out of space.

Still no real mechanism to deal with the potential for proliferation of
".UW" instructions in RISC-V, for now I had been ignoring this.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Thu Oct 16 16:26:27 2025

From Newsgroup: comp.arch

On 10/16/2025 2:04 AM, David Brown wrote:

On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD
extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers. With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are. I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.

But, there is problem:
Once you go wider than 2 or 4 elements, cases where wider SIMD brings
more benefit tend to fall off a cliff.

More so, when you go wider, there are new problems:
Vector Masking;
Resource and energy costs of using wider vectors;
...

Then, for 'V':
In the basic case, it effectively doubles the size of the register file
vs 'G';
...

Then We have x86 land:
SSE: Did well;
AVX256: Rocky start, negligible benefit from the YMM registers;
Using AVX encodings for 128-bit vectors being arguably better.
AVX512: Sorta exists, but:
Very often not supported;
Trying to use it (on supported hardware) often makes stuff slower.

If even Intel can't make their crap work well, I am skeptical.

While arguably GPUs were very wide, it is different:
They were often doing very specialized tasks (such as 3D rendering);
And, often with a SIMT model rather than "very large SIMD";
Things like CUDA (and RTX) actually push things narrower;
Larger numbers of narrower cores,
rather than smaller number of wider cores.
...

The one area that doesn't seem to run into a diminishing returns wall
seems to be to map "embarrassingly parallel" problems to large numbers
of processor cores, and to try to keep things as loosely coupled as
possible.

This works mostly until the CPU runs out of memory bandwidth or similar.

Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the “R” stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

Agreed, this is more the stance I take.

Instructions should be simple for hardware and to try to allow for low latency. Rather than trying to make the instruction listing small.

Though, that said, I still did end up in my case making most
instructions have a 2 or 3 cycle latency.

So, generally, MOV-RR and MOV-IR end up as basically about the only single-cycle instructions. A case could almost be made for making *all* instructions 2 or 3 cycles and then eliminate forwarding from EX1
entirely (or maybe add an EX4 stage).

Say:
PF IF ID RF E1 E2 E3 WB
FW from E2 and E3
RAW hazard between RF and E1 always stalls.
Or:
PF IF ID RF E1 E2 E3 E4 WB
FW from E2, E3, and E4.

With an E4 stage, one could maybe allow for pipelined low-precision FMAC
or similar.

Though, I see it more as the ISA not actively hindering achieving >= 1
IPC throughput, rather than instructions having 1 cycle latency.

But, can note that having 2 cycle latency does hinder the efficiency of
some common patterns in RISC-V, where tight register RAW dependencies
run rampant.

So, say, you ideally want 5-8 instructions between each instruction and
the next instruction that uses the result. This typically does not
happen in most code, and particularly not if one needs instruction
chains for semi-common idioms (say, where the optimal instruction
scheduling would far exceed the length of a typical loop body).

For better or worse does tend to result in a lot of performance
sensitive code being written to use fairly heavy-handed loop unrolling
though.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Thu Oct 16 21:52:22 2025

From Newsgroup: comp.arch

EricP <[email protected]> posted:

MitchAlsup wrote:

EricP <[email protected]> posted: ---------------------------

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That >>> is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. >>> However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of >>> Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would >>> not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

Yes, order in OoO is sanity-impairing.

But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your µA does not update µA state prior to retire,
you can be as OoO as you like and still not be Spectré sensitive.

One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.

My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register >> write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.

My idea is the same as a SUB instruction with overflow detect,
which I would already have. I like cheap solutions.

But the core idea here, to eliminate a control flow race condition by changing it to a data flow dependency, may be applicable in other areas.

This adds unnecessary execution latency to the architectural path.
Without the check you have <say> 3-cycle unchecked LD
With the check you have 4-cycle checked LD

Now get some multi-pointer chasing per iteration algorithm in a loop and
all of a sudden the execution window is no longer big enough to run it at
full speed.

Because there is no branch, there is no way to speculate around the check >> (but load value speculation could negate this fix).

On second thought, no, load value speculation would not negate this fix.

x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and µfaults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at µA level.

I'd prefer not to step in that cow pie to begin with.

Just making sure you remain aware of the cow-pies littering the field...

Then I won't have to spend time cleaning my shoes afterwards.

I am more worried about the blood on the shoes than the cow-pie.
{{shooting oneself in the foot}}
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Thu Oct 16 21:59:14 2025

From Newsgroup: comp.arch

Terje Mathisen <[email protected]> posted:

MitchAlsup wrote:

Terje Mathisen <[email protected]> posted:
----------------------

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if >> you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.

I.e. they differ by exactly one ulp.

In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.

In practice:: RSQRT() is no harder to compute {both HW and SW},
yet:: RSQRT() is more useful::

SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV

Useful in vector normalization::

some-vector-calculation
-----------------------
SQRT( SUM(x**2,1,n) )

and a host of others.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Thu Oct 16 22:19:21 2025

From Newsgroup: comp.arch

David Brown <[email protected]> posted:

On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up the instruction opcode space with a combinatorial explosion. (Or sequence of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers.

Among SIMD's ISA problems is additional state at context switch time
on top of FP's added state at context switch time; but with all the
fast memory move subroutines being SIMD-based--the service routines
need access to SIMD that they don't normally need for FP {and the
SIMD register file is larger, too}

With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are.

Vector LD and ST instructions are not conceptually different than
LDM and STM--1 instruction accesses multiple memory locations.

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve
many memory aliasing issues to use the vector ISA.

Software writes vector loops--yet the HW vectorizes instructions.

{{I might note My 66000 vectorizes loops not instructions to avoid
this problem; For example::

for( i = 0; i < max; i++ )
{
temp = a[i];
a[i] = a[max-i];
a[max-i] = temp;
}

is vectorizable in My 66000--those loops where the memory references
do not overlap can run "as fast as the width of the data path allow"
while those with memory reference collisions run no worse than scalar
code. For a large value of max the profile would look like::

FFFFFFFFFFFFFFFFFsssFFFFFFFFFFFFFFFFF

F representing fast (say 4-wide or 8-wide)
s representing slow (say 1-wide)

The same binary runs as fast as memory references (and data-flow
dependencies and data-path width) allow.
}}

I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.

Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.

CRAY-like vector computers built memory systems that could handle the load
of the vector calculations. CRAY-1 could perform a new memory access every clock, CRAY-[XY]MP could handle 2 LDs and 1 ST per clock continuously.

If those CPUs of today were really going to fully utilize the vector
data-path, they are going to have to have a lot better memory system
than they are building presently (1 new cache miss per cycle).

The power of the vector computers was almost entirely in the memory system
not in the data path (which is surprisingly easy to build, and surprisingly difficult to keep fed).

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the “R” stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

On vacation over the summer, I canned a new phrase to denote what I
hope My 66000 will end up being::

CARD Computer Architecture Rightly Done.

Note: It does not stop at ISA--as ISA is less than 1/3rd of what a
computer architecture is and means.

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@[email protected] to comp.arch on Thu Oct 16 23:13:58 2025

From Newsgroup: comp.arch

Hope the attributions are correct.

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
<[email protected]d> wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

:

In any case, even with these languages there are still software projects >> > that fail, miss their deadlines and have overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out how to >make the (17 kinds of) hammers one needs, there it little need to make a
new hammer architecture.

Windows could have stopped at W7, and many MANY people would have been >happier... The mouse was more precise in W7 than in W8 ... With a little >upgrade for new PCIe architecture along the way rather than redesigning
whole kit and caboodle for tablets and phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand >for the product, but pushed by companies to make already satisfied users
have to upgrade.

Those programmers could have transitioned to new SW projects rather than >redesigning the same old thing 8 more times. Presto, there is now enough
well trained SW engineers to tackle the undone SW backlog.

The problem is that decades of "New & Improved" consumer products have conditioned the public to expect innovation (at minimum new packaging
and/or advertising) every so often.

Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by many
current developers due to concern that the project has been abandoned.
That perception likely would not change even if the author(s)
responded to inquiries, the library was suitable "as is" for the
intended use, and the lack of recent updates can be explained entirely
by a lack of new bug reports.

Why take a chance? There simply _must_ be a similar project somewhere
else that still is actively under development. Even if it's buggy and unfinished, at least someone is working on it.

YMMV but, as a software developer myself, this attitude makes me sick.
8-(
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 06:48:27 2025

From Newsgroup: comp.arch

On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many memory aliasing issues to use the vector ISA.

Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 06:51:18 2025

From Newsgroup: comp.arch

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive the (otherwise universal) transition
to RISC was kept afloat through high revenues and high margins, which
allowed the company to spend the much higher sums needed to add all the
extra millions of transistors necessary to keep performance competitive.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 06:53:16 2025

From Newsgroup: comp.arch

On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:

Looking at

The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

he never says what defines RISC, just what improved results this *design approach* should achieve.

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set Computer” -- because the one feature that really did become common was the larger register sets.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 07:03:16 2025

From Newsgroup: comp.arch

On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common code
to execute the actual loop bodies.

Most use-cases for longer vectors tend to matrix-like rather than vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Back in the days of Seymour Cray, his machines were getting useful results
out of vector lengths up to 64 elements.

Perhaps that was more a substitute for parallel processing.

There are some weaknesses, for example, I mostly ended up dealing with
RGB math by simply repeating the 8-bit values twice within a 16-bit
spot.

Maybe it’s time to look beyond RGB colours. I remember some “Photo” inkjet
printers had 5 or 6 different colour inks, to try to fill out more of the
CIE space. Computer monitors could do the same. Look at the OpenEXR image format that these CG folks like to use: that allows for more than 3 colour components, and each component can be a float -- even single-precision
might not be enough, so they allow for double precision as well.

BSR and JSR had been modified to allow arbitrary link register, but it
may make sense to reverse this; as Rd other than X0 and X1 is seemingly pretty much never used in practice (so not really worth the logic cost).

POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to
address in CTR and leave return address in LR).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Fri Oct 17 13:54:50 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive

There are two of them..

the (otherwise universal)
transition to RISC was kept afloat through high revenues and high
margins, which allowed the company to spend the much higher sums
needed to add all the extra millions of transistors necessary to keep performance competitive.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Fri Oct 17 13:59:33 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:

Looking at

The Case for the Reduced Instruction Set Computer, 1980, David
Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

he never says what defines RISC, just what improved results this
*design approach* should achieve.

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set Computer” -- because the one feature that really did become common
was the larger register sets.

Larger register sets were common, but not universal.
Load/store architecture was (with allowance for exceptions for
synchronization primitives that are not expected to be as fast as
normal instructions) appears to be universal.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Fri Oct 17 14:31:46 2025

From Newsgroup: comp.arch

On 16/10/2025 23:26, BGB wrote:

On 10/16/2025 2:04 AM, David Brown wrote:

On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single
primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or >>>> vice versa)--they simply represent different ways of shooting yourself >>>> in the foot.

The primary design criterion, as I understood it, was to avoid
filling up
the instruction opcode space with a combinatorial explosion. (Or
sequence
of combinatorial explosions, when you look at the wave after wave of
SIMD
extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on
different hardware. With SIMD, you need different code if your
processor can add 4 ints at a time, or 8 ints, or 16 ints - it's all
different instructions using different SIMD registers. With the
vector style instructions in RISC-V, the actual SIMD registers and
implementation are not exposed to the ISA and you have the same code
no matter how wide the actual execution units are. I have no
experience with this (or much experience with SIMD), but that seems
like a big win to my mind. It is akin to letting the processor
hardware handle multiple instructions in parallel in superscaler cpus,
rather than Itanium EPIC coding.

But, there is problem:
Once you go wider than 2 or 4 elements, cases where wider SIMD brings
more benefit tend to fall off a cliff.

More so, when you go wider, there are new problems:
Vector Masking;
Resource and energy costs of using wider vectors;
...

I appreciate that. Often you will either be wanting the operations to
be done on a small number of elements, or you will want to do it for a
large block of N elements which may be determined at run-time. There
are some algorithm, such as in cryptography, where you have sizeable but fixed-size blocks.

When you are dealing with small, fixed-size vectors, x86-style SIMD can
be fine - you can treat your four-element vectors as single objects to
be loaded, passed around, and operated on. But when you have a large
run-time count N, it gets a lot more inefficient. First you have to
decide what SIMD extensions you are going to require from the target,
and thus how wide your SIMD instructions will be - say, M elements.
Then you need to loop N / M times, doing M elements at a time. Then you
need to handle the remaining N % M elements - possibly using smaller
SIMD operations, possibly doing them with serial instructions (noting
that there might be different details in the implementation of SIMD and
serial instructions, especially for floating point).

The resulting code is big, ugly, tuned to specific targets (it will be
slower than optimal if run on a target with wider SIMD, and won't run at
all on a target with narrower SIMD), and have huge overhead if it
happens to be run with a small N. Oh, and it might not work - or work
less efficiently - if the data alignments are not ideal.

Vector processing avoids pretty much all of those disadvantages.

Just try writing a loop function in godbolt.org, and compile it with x86
clang or gcc -O3 -march=rocketlake, and compare the results to compiling
it for risc-v with -march=rv64gv.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Fri Oct 17 14:38:08 2025

From Newsgroup: comp.arch

On 17/10/2025 08:48, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many
memory aliasing issues to use the vector ISA.

Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?

"restrict" can significantly improve non-vectored code too, as well as
more "ad-hoc" vectoring of code where the compiler uses general-purpose registers, but interlaces loads, stores and operations to improve
pipelining. But it is certainly a very useful qualifier for vector code.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 13:00:48 2025

From Newsgroup: comp.arch

On 10/17/2025 5:54 AM, Michael S wrote:

On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive

There are two of them..

AFAIK:
x86 / x86-64: Alive and well in PCs.
6502: Now dead (no more 6502's being made)
65C816: Still holding on (niche), backwards compatible with 6502.
Z80: Dead
M68K: Mostly Dead
NXP ColdFire: Still lives (Simplified M68K).
MSP430: Still Lives (I classify it as a CISC).
IBM S/360: Dead on real HW
Lives on in emulation.

In looking around, I noted that apparently my VUGID/ACLID idea isn't
entirely novel. Apparently similar existed in S/360 and IA-64 under the
name of "Protection Keys".

Then again, the origin of this idea is my case was basically "borrowed"
from the "Tron 2.0" game, which presented a similar idea in the game (to justify why doors could be locked, a normal game mechanic), and I was
left thinking "Why not?..."

Well, apparently real HW did do this, just not x86 or ARM or similar...

the (otherwise universal)
transition to RISC was kept afloat through high revenues and high
margins, which allowed the company to spend the much higher sums
needed to add all the extra millions of transistors necessary to keep
performance competitive.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 11:49:03 2025

From Newsgroup: comp.arch

On 10/17/2025 11:00 AM, BGB wrote:

On 10/17/2025 5:54 AM, Michael S wrote:

On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive

There are two of them..

AFAIK:
x86 / x86-64: Alive and well in PCs.
6502: Now dead (no more 6502's being made)
 65C816: Still holding on (niche), backwards compatible with 6502.
Z80: Dead
M68K: Mostly Dead
 NXP ColdFire: Still lives (Simplified M68K).
MSP430: Still Lives (I classify it as a CISC).
IBM S/360: Dead on real HW
 Lives on in emulation.

As I am sure others will verify, the compatible descendants of the S/360
are alive in real hardware. While I expect there haven't been any "new
name" customers in a long time, the fact that IBM still introduces new
chips every few years indicates that there is still a market for this architecture, presumably by existing customer's existing workload
growth, and perhaps new applications related to existing ones.

Some of the original BUNCH architectures do live on in emulation
(Burroughts, Univac, Honeywell). I believe the other two, CDC and NCR
are dead.

I expect that all of the minicomputer age architectures are dead.

There also were lots of microcomputer "chip" architectures that are dead (National Semi, ATT, Fairchild, etc.), but I don't necessarily attribute
that to being overtaken by RISC architectures.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 14:43:50 2025

From Newsgroup: comp.arch

On 10/17/2025 1:49 PM, Stephen Fuld wrote:

On 10/17/2025 11:00 AM, BGB wrote:

On 10/17/2025 5:54 AM, Michael S wrote:

On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive

There are two of them..

AFAIK:
 x86 / x86-64: Alive and well in PCs.
 6502: Now dead (no more 6502's being made)
 65C816: Still holding on (niche), backwards compatible with 6502. >> Z80: Dead
 M68K: Mostly Dead
 NXP ColdFire: Still lives (Simplified M68K).
 MSP430: Still Lives (I classify it as a CISC).
 IBM S/360: Dead on real HW
 Lives on in emulation.

As I am sure others will verify, the compatible descendants of the S/360
are alive in real hardware. While I expect there haven't been any "new name" customers in a long time, the fact that IBM still introduces new
chips every few years indicates that there is still a market for this architecture, presumably by existing customer's existing workload
growth, and perhaps new applications related to existing ones.

OK.

I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.

Some of the original BUNCH architectures do live on in emulation (Burroughts, Univac, Honeywell). I believe the other two, CDC and NCR
are dead.

I expect that all of the minicomputer age architectures are dead.

There also were lots of microcomputer "chip" architectures that are dead (National Semi, ATT, Fairchild, etc.), but I don't necessarily attribute that to being overtaken by RISC architectures.

OK.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 13:10:25 2025

From Newsgroup: comp.arch

On 10/17/2025 12:43 PM, BGB wrote:

On 10/17/2025 1:49 PM, Stephen Fuld wrote:

On 10/17/2025 11:00 AM, BGB wrote:

On 10/17/2025 5:54 AM, Michael S wrote:

On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive

There are two of them..

AFAIK:
 x86 / x86-64: Alive and well in PCs.
 6502: Now dead (no more 6502's being made)
 65C816: Still holding on (niche), backwards compatible with 6502. >>> Z80: Dead
 M68K: Mostly Dead
 NXP ColdFire: Still lives (Simplified M68K).
 MSP430: Still Lives (I classify it as a CISC).
 IBM S/360: Dead on real HW
 Lives on in emulation.

As I am sure others will verify, the compatible descendants of the
S/360 are alive in real hardware. While I expect there haven't been
any "new name" customers in a long time, the fact that IBM still
introduces new chips every few years indicates that there is still a
market for this architecture, presumably by existing customer's
existing workload growth, and perhaps new applications related to
existing ones.

OK.

I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.

I have heard that idea several times before. I wonder where it came from?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 15:32:39 2025

From Newsgroup: comp.arch

On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common code
to execute the actual loop bodies.

The SuperH also did this for the FPU:
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.

Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Back in the days of Seymour Cray, his machines were getting useful results out of vector lengths up to 64 elements.

Perhaps that was more a substitute for parallel processing.

Maybe.
Just in my own experience, it seems to fizzle out pretty quickly.

Typically the combination of diminishing returns as cost reaches for the
sky.

It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.

So, in this case, a truer analog of Cray style vectors would not be
variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.

In my case, though, to have any real advantage over the existing SIMD, I
would effectively need a wider memory interface (say, one capable of
doing 2 loads and 1 store per cycle). If limited to 1 memory access per
cycle, it would still be effectively limited to ~ 1 element/cycle on
average (or maybe 2 elements/cycle with Binary16; since I could
effectively load/store 128 bits at a time, assuming a SIMD-op
co-executing with one of the memory ops).

Ironically, this is one of the merits of FP8 and block-encoding for
weights in NNs: Can effectively batch the memory accesses (by loading
larger units) so is slightly less hindered by a 1-access-per-cycle
limitation.

Though, even if I did have a wider pipe to memory, there would still the problem of memory bandwidth (to L2 and to external RAM). And, would
likely need semi-intelligent streaming or prefetching to get more
effective use out of what DRAM bandwidth exist (RAM access via the L2
cache being somewhat slower than the raw bandwidth to the RAM chip).

Though, one trick was used in the L1 cache that helps:
If the access cache line also misses, and the following cache line
misses, then handle the misses for both cache lines at the same time (it assumes that the next line address is likely to be accessed in the near future).

In premise, the L2 cache could use similar logic; though this would
require more logic, as normally the L2 cache deals with each line access independently (vs the L1 cache which has to deal with the possibility
line crossing access as a normal part of its operation).

Well, and the L1 I$ also always requiring both lines to hit (whether or
not the current fetch crosses a line boundary), vs the L1D$ which only
needs to stall if the current access misses. A possible optimization
could be to allow for asynchronous prefetch, but this could lead to more complex scenarios; such as needing to stall for a miss but then wait for
a preceding in-flight RAM access to finish before the next request could
be issue. So in this case the L1D$ doesn't allow for asynchronous fetch,
even if it could be faster. Would otherwise need more complex logic to
deal with asynchronous memory prefetching in a way that doesn't put
stability at risk.

There are some weaknesses, for example, I mostly ended up dealing with
RGB math by simply repeating the 8-bit values twice within a 16-bit
spot.

Maybe it’s time to look beyond RGB colours. I remember some “Photo” inkjet
printers had 5 or 6 different colour inks, to try to fill out more of the
CIE space. Computer monitors could do the same. Look at the OpenEXR image format that these CG folks like to use: that allows for more than 3 colour components, and each component can be a float -- even single-precision
might not be enough, so they allow for double precision as well.

IME:
The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).

*: The better solution to possible banding issues being not so much to
use more color depth, but rather to dither. Though, AFAIK a lot of LCD
panels have built-in dithering, so rather than seeing either true RGB24,
or an more obviously banded RGB555 or RGB666 approximation, the monitor
will show a representation with a Bayer dither or similar applied (which
is mostly not noticeable unless one looks very closely).

For HDR:
3x E4.F4 is pretty comparable to RGB555 in terms of quality;
2x Binary16 is plenty.

Binary32 or Binary64 seems like serious overkill for HDR image storage.

Well, and then there is the R11_G11_B10 format:
R=E5.M6, G=E5.M6, B=E5.M5

Which is possibly a better option:
Will match/exceed display quality while still allowing HDR, and more
compact storage than 3x Binary16.

Or, RGB9_E5, ...

One other traditional HDR format is RGB8_E8, but this has its own wonk.

Though, within existing monitors or computers, little can be done to
improve over RGB.

Had noted though that for me, IRL, monitors can't really represent real
life colors. Like, I live in a world where computer displays all have a
slight tint (with a similar tint and color distortion also applying to
the output of color laser printers; and a different color distortion for inkjet printers).

So, it is like:
Real life, computers, and inkjet printers, all exist in similar but
different worlds in terms of color display.

Well, also LED bulbs, particularly cheap ones or multi-color ones,
tending to make everything look computer-like (bleh; I actually prefer
the look of CFLs over this; or halogen bulbs which can at least make a
proper white light...).

Had noted when messing around with LEDs, that one generally needs 4 LEDS
to get something that looks like natural white light.

IME, I could get this effect with two different schemes:
R: 675nm
G: 525nm
H: 480nm
B: 440nm
And, with more readily available LEDs:
R: 675nm
G: 525nm
H: 465nm
B: 400nm
Where, either 480nm+440nm or 465nm+400nm can allow for something
resembling pure white. Can sort of approximate real colors by setting
the H value to a blend of G and B.

Comparably, 456nm and 400nm LEDs are easier to find, but proper 440nm
and 480nm LEDs are a pain to find (where, 480nm is sort of a unique
color that doesn't really exist on computer displays).

Can note that 400nm looks different on phone vs real life:
Phone sees it as a pinkish color;
Real life, it looks like a very strong blue (similar to 440nm).
The 465nm LED is a little closer to the sky in real life, but not a good
match for "blue" on computer displays (usually closer to 440nm).

But, neither really match cyan or azure on computers, which for me is a different color (more of a separate mixing of green and blue).

But, partly it is a case of, "meh, it is what it is". Oddly no one else
really seems to notice the issue, so, ...

For my uses, for storing HDR within JPEG or UPIC (1), (a custom, vaguely JPEG-like format), had generally used 3x E4.M4 or similar (which mostly
works fine, albeit looks a little funky if one looks at the HDR image as linear RGB).

Where, UPIC is a format sorta like T.81 JPEG, with some changes:
Huffman -> STF+AdRice
DCT -> Block-Haar
Still uses an 8x8 transform organized into 16x16 blocks.
Different VLC scheme (Z3.V5)
Uses RCT vs YCbCr
Uses a TLV packaging scheme.
Mostly TWOCC's, with lengths stored inverted.
The scheme allowed a nice way to allow variable tag/length sizes.
Placed more limits on the allowed subsampling modes:
4:2:0, 4:4:4 (RGB)
4:2:0:4, 4:4:4:4 (RGBA)
4:0:0, 4:0:0:4 (Monochrome, Monochrome+Alpha)

Though, there are a few close-calls in terms of optimal choice:
STF+AdRice vs Huffman with a 13-bit length limit.
Smaller length limit on Huffman makes it faster.
Below 12 or 13 severely reduces its effectiveness though.
However, STF+AdRice needs very little context and has fast setup.
Also less code vs 13-bit Huffman.
But, in a strict sense, both speed and compression are worse.
Block-Haar vs WHT
Both Block-Haar and WHT are exactly reversible (unlike DCT).
DCT can be made reversible, but this form is very slow.
Block-Haar does better with synthetic images, WHT with photos.
Block-Haar is slightly faster.
RCT has a close-call with YCoCg.
RCT was both slightly faster and compressed better in my testing.
Both are reversible, unlike YCbCr.

The use of fully reversible transforms does allow also using the format
in place of PNG. In a PNG-like role, it tends to both often compress
slightly better, as well as being faster to decode and uses less working
RAM. Decoding a PNG needs a significant chunk of intermediate memory,
which can be side-stepped in both UPIC and in an optimized JPEG decoder; though a "generic" JPEG decoder would need more working memory
(typically needing buffers for decoding the luma and chroma planes for
the whole image, rather than working "one block at a time").

Where, for context:
STF+AdRice:
Swap-Towards-Front:
Starts with an initial permutation of all symbols in order;
Encoding a symbol swaps it towards the front;
Symbols are encoded as their index in this table;
Tends to converge towards an optimal ranking.
AdRice: Adaptive Golomb-Rice Coding
Can be used in a similar way to Huffman or Adaptive Huffman.
But, significantly faster than Adaptive Huffman.
Block Haar: Uses a 2D transform built from 1D transforms, like DCT
I0..I7 -> O0..O7 basically
J0=(I0+I1)/2 J1=(I2+I3)/2 J2=(I4+I5)/2 J3=(I6+I7)/2
J4=I0-I1 J5=I2-I3 J6=I4-I5 J7=I6-I7
K0=(J0+J1)/2 K1=(J2+J3)/2 K2=J0-J1 K3=J2-J3
L0=(K0+K1)/2 L1=K0-K1
O0..O7 = {L0,L1,K2,K3,J4,J5,J6,J7}
Can use the same ZigZag ordering and Quantization approach as JPEG.
Albeit with different math for building the quantization matrix.
Filling the matrix with all 1's allowing for lossless.

Different choices might be made if the goal was to have maximum
compression, but I was biased more towards wanting to keep the decoder
size modest and reasonably fast.

Though, it is possible to gain compression (at the cost of speed) by
running the image bitstream bytes through an LZMA style range coder
(though, a harder problem is making a range-coder fast).

The change in VLC scheme:
3 bits encode the run of zeroes (so, can't skip as many zeroes);
Uses 5 bits for the coefficient value;
Coefficient uses a similar encoding scheme to Distance values in Deflate; Signed values are zigzag folded: "V1=(V0<<1)^(V0>>31);"

IME, this is both less awkward and also compresses slightly better than
the scheme originally used by JPEG. Though, as with JPEG, an 00 symbol
can encode an early EOB (all remaining coefficients zero).

Can note that, one scheme I had used elsewhere for Huffman coding was:
Symbols are limited to 13 bits;
A shorter limit makes the lookup table smaller;
So, less setup time, and less L1 misses in decoding.
Huffman tables are encoded as a series of 4 bit lengths;
0..D: Symbol Length
E,x: RLE run of preceding length.
F,x: RLE run of zeroes.
Both simpler and cheaper to decode than the scheme used by Deflate.
While typically being similarly compact.

BSR and JSR had been modified to allow arbitrary link register, but it
may make sense to reverse this; as Rd other than X0 and X1 is seemingly
pretty much never used in practice (so not really worth the logic cost).

POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to address in CTR and leave return address in LR).

Jumping to an arbitrary address can be useful.
Using whatever random register as a link register, not as much.

So, nearly always, it is one of:
X0: Plain branch;
X1: Branch-with-link.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 20:54:23 2025

From Newsgroup: comp.arch

George Neuner <[email protected]> posted:

Hope the attributions are correct.

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup <[email protected]d> wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

:

In any case, even with these languages there are still software projects >> > that fail, miss their deadlines and have overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out how to >make the (17 kinds of) hammers one needs, there it little need to make a >new hammer architecture.

Windows could have stopped at W7, and many MANY people would have been >happier... The mouse was more precise in W7 than in W8 ... With a little >upgrade for new PCIe architecture along the way rather than redesigning >whole kit and caboodle for tablets and phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand >for the product, but pushed by companies to make already satisfied users >have to upgrade.

Those programmers could have transitioned to new SW projects rather than >redesigning the same old thing 8 more times. Presto, there is now enough >well trained SW engineers to tackle the undone SW backlog.

The problem is that decades of "New & Improved" consumer products have conditioned the public to expect innovation (at minimum new packaging
and/or advertising) every so often.

Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by many
current developers due to concern that the project has been abandoned.
That perception likely would not change even if the author(s)
responded to inquiries, the library was suitable "as is" for the
intended use, and the lack of recent updates can be explained entirely
by a lack of new bug reports.

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

Most Floating Point Libraries are in a similar position. They were
updated after IEEE 754 became widespread and are as good today as
ever.

{FF1, Tomography, CFD, FEM} have needed no real changes in decades.

Sometimes, Software is "done". You may add things to the package
{like a new crescent wrench} but the old hammer works just as well
today as 30 years ago when you bought it.

Why take a chance?

On the last day of SW support for W10--they (THEY) updated several
things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!

To the SW vendor, they want to be able to update their SW any time
they want. Yet, the application user wants the same bugs to remain
constant over the duration of the WHOLE FRIGGEN project--because
once you found them and figured a way around them, you don't want
them to reappear somewhere else !!!

There simply _must_ be a similar project somewhere
else that still is actively under development. Even if it's buggy and unfinished, at least someone is working on it.

I understand--but this bites more often than the conservative approach.

YMMV but, as a software developer myself, this attitude makes me sick.
8-(

I was in a 3-year project where we had to forgo upgrading from SunOS
to Solaris because the SW license model changes would have put us out
of business before project completion.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 20:55:51 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive the (otherwise universal) transition to RISC was kept afloat through high revenues and high margins, which allowed the company to spend the much higher sums needed to add all the extra millions of transistors necessary to keep performance competitive.

Never underestimate the work designers can do when given cubic dollars
of budget under which to work.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 20:59:10 2025

From Newsgroup: comp.arch

Michael S <[email protected]> posted:

On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:

If even Intel can't make their crap work well, I am skeptical.

The only CISC architecture to survive

There are two of them..

Only one selling more than 1M per month.

the (otherwise universal)
transition to RISC was kept afloat through high revenues and high
margins, which allowed the company to spend the much higher sums
needed to add all the extra millions of transistors necessary to keep performance competitive.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 15:59:51 2025

From Newsgroup: comp.arch

On 10/17/2025 1:48 AM, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many
memory aliasing issues to use the vector ISA.

Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?

Ironically, this is also partly why I suspect if a C-like language could
have a "T[]" type that was distinct from "T*" could be useful, even if
they were the same representation internally (a bare memory pointer):
"T[]" could be safely assumed to never alias except in cases where it
could two references to the same array (in which case, they will only
alias if the same index; and this likely only matter if the inputs and
outputs are potentially the same array).

Though, in C, "int arr[]" as an argument is regarded as equivalent to
"int *arr", so no useful conclusions could be drawn in this case
(ideally one would need a language where implicit conversion from
"T*"->"T[]" is an error, and implicit conversion from "T[]"->"T*" is a warning).

But, alas...

At least in theory "restrict" works, when people use it.

Though, "assume TBAA as the universal default" makes some cases faster,
while screwing over some other use cases; and still doesn't fully
resolve the type-alias issue as one still can't assume non-alias in
cases when types are the same.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 16:15:33 2025

From Newsgroup: comp.arch

On 10/17/2025 5:59 AM, Michael S wrote:

On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:

Looking at

The Case for the Reduced Instruction Set Computer, 1980, David
Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

he never says what defines RISC, just what improved results this
*design approach* should achieve.

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common
was the larger register sets.

Larger register sets were common, but not universal.
Load/store architecture was (with allowance for exceptions for synchronization primitives that are not expected to be as fast as
normal instructions) appears to be universal.

Yeah.

Otherwise, RISC-V's 'A' extension (which is a serious violation of
Load/Store) would be a bigger problem.

But, I have since realized that (because GCC/etc never really uses these instructions for general code generation) one can roll-back on native
hardware support and handle them as traps...

Granted, the preferable option in this case is to have something like "MutexLock()" or "EnterCriticalSection()" as a system call (as, one a
machine without true atomic operations, and with weak memory coherence,
a system call that is aware of the actual HW behavior is preferable to
just trying to fake it in trap handlers).

Well, and on a single core system, it reduces to a single choice:
Caller locks mutex, return to caller;
Mutex can't be locked right now,
flag it and schedule another task
(and hope mutex unlocks eventually).

Well, contrast to using spinlocks in userland, which only really makes
sense if one assumes:
There are multiple cores;
Memory accesses are sequentially consistent between threads.

And, if the implementation needs to trap on a FENCE or similar, it has
already lost.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:07:18 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common was
the larger register sets.

Larger register sets were common, but not universal.

Where is there an architecture you would class as “RISC”, but did not have a “large” register set?

(How “large” is “large”? The VAX had 16 registers; was there any RISC architecture with only that few?)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:20:49 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:

On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common
code to execute the actual loop bodies.

The SuperH also did this for the FPU:
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.

That was probably not cost-effective for scalar instructions, because it
would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.

Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.

Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Back in the days of Seymour Cray, his machines were getting useful
results out of vector lengths up to 64 elements.

Perhaps that was more a substitute for parallel processing.

Maybe. Just in my own experience, it seems to fizzle out pretty quickly.

Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively-
parallel supers, or RISC machines etc. Maybe the parallelism was thought
to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)

Short-vector SIMD was introduced along an entirely separate evolutionary
path, namely that of bringing DSP-style operations into general-purpose
CPUs.

It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.

Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the
two were bound to have underlying features in common.

So, in this case, a truer analog of Cray style vectors would not be
variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.

But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units and superscalar execution.

Maybe it’s time to look beyond RGB colours. I remember some “Photo”
inkjet printers had 5 or 6 different colour inks, to try to fill out
more of the CIE space. Computer monitors could do the same. Look at the
OpenEXR image format that these CG folks like to use: that allows for
more than 3 colour components, and each component can be a float --
even single-precision might not be enough, so they allow for double
precision as well.

IME:
The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).

First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.

Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.

Sure, you may think 64-bit floats must be overkill for this purpose; but
these are artists you’re dealing with. ;)

Had noted though that for me, IRL, monitors can't really represent real
life colors. Like, I live in a world where computer displays all have a slight tint (with a similar tint and color distortion also applying to
the output of color laser printers; and a different color distortion for inkjet printers).

That is always true; “white” is never truly “white”, which is why those
who work in colour always talk about a “white point” for defining what is meant by “white”, which is the colour of a perfect “black body” emitter at
a specific temperature (typically 5500K or above).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:21:31 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 20:55:51 GMT, MitchAlsup wrote:

Never underestimate the work designers can do when given cubic dollars
of budget under which to work.

“Cubic dollars” ... I like that. ;)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Fri Oct 17 22:24:23 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 14:43:50 -0500, BGB wrote:

I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.

I’ve been told that’s wrong about zArchitecture, but it is true for iArchitecture (current incarnation of AS/400 -- yes, apparently that’s
still around in a small way, too).
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Fri Oct 17 22:52:55 2025

From Newsgroup: comp.arch

BGB <[email protected]> posted:

On 10/17/2025 1:48 AM, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many >> memory aliasing issues to use the vector ISA.

Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?

Ironically, this is also partly why I suspect if a C-like language could have a "T[]" type that was distinct from "T*" could be useful, even if
they were the same representation internally (a bare memory pointer):
"T[]" could be safely assumed to never alias

Restrict does nothing to make:: (from a few days ago)
{
I might note My 66000 vectorizes loops not instructions to avoid
this problem; For example::

for( i = 0; i < max; i++ )
{
temp = a[i];
a[i] = a[max-i];
a[max-i] = temp;
}

}

Runs fast and get the right answer. When i !~= max, the loop
runs at vector speeds, when i ~= (max-i) it runs slow to get
the right answer. Where ~= is within a cache line.

At least in theory "restrict" works, when people use it

under the specification restrict has

.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@[email protected] to comp.arch on Fri Oct 17 19:52:05 2025

From Newsgroup: comp.arch

Stephen Fuld wrote:

On 10/17/2025 12:43 PM, BGB wrote:

On 10/17/2025 1:49 PM, Stephen Fuld wrote:

As I am sure others will verify, the compatible descendants of the
S/360 are alive in real hardware. While I expect there haven't been
any "new name" customers in a long time, the fact that IBM still
introduces new chips every few years indicates that there is still a
market for this architecture, presumably by existing customer's
existing workload growth, and perhaps new applications related to
existing ones.

OK.

I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.

I have heard that idea several times before. I wonder where it came from?

The AS400 cpu was replaced by Power and an emulation layer. https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

The z-series was always a different cpu, but maybe they
shared development groups with Power. The stages of the
z15 core (2019) doesn't look anything like Power10 (2021).

https://www.servethehome.com/wp-content/uploads/2020/08/Hot-Chips-32-IBM-Z15-Processor-Pipeline.jpg

https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-microarchitecture-block-diagram/
https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-microarchitecture-core-flexibility/

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat Oct 18 00:37:43 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common was >> the larger register sets.

Larger register sets were common, but not universal.

Where is there an architecture you would class as “RISC”, but did not have
a “large” register set?

See Univac 1108

(How “large” is “large”? The VAX had 16 registers; was there any RISC
architecture with only that few?)

Clipper.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sat Oct 18 00:42:27 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:

On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common
code to execute the actual loop bodies.

The SuperH also did this for the FPU:
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.

That was probably not cost-effective for scalar instructions, because it would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.

Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.

Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Back in the days of Seymour Cray, his machines were getting useful
results out of vector lengths up to 64 elements.

Perhaps that was more a substitute for parallel processing.

Maybe. Just in my own experience, it seems to fizzle out pretty quickly.

Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively- parallel supers, or RISC machines etc. Maybe the parallelism was thought
to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)

Short-vector SIMD was introduced along an entirely separate evolutionary path, namely that of bringing DSP-style operations into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.

Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the two were bound to have underlying features in common.

CDC 7600 had rather explicit pipelining--a lot more ordered than
CDC 6600.

So, in this case, a truer analog of Cray style vectors would not be variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.

But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units and superscalar execution.

Which is a GOOD thing !!

The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).

First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.

That is why the arithmetic is done in 16-bits.

Sure, you may think 64-bit floats must be overkill for this purpose; but these are artists you’re dealing with. ;)

Many can see gamut colors you cannot discern--and they care about it.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 01:05:16 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations into
general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

Actually, there was something Intel tried to do before that, called “NSP”, for “Native Signal Processing”, which was supposed to be the DSP-killer. Microsoft nixed that idea, for some reason.

MMX came later, and as you may recall, it was a bit of a fudge (sharing registers with the floating-point unit), and not a very successful one at that. Intel couldn’t even decide what “MMX” meant: first it was supposed to be “Multi-Media eXtensions”, then that was changed to “means nothing at
all”? Why? So it could be trademarked, of course!

But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units
and superscalar execution.

Which is a GOOD thing !!

Which? Lockstep SIMD, or more asynchronous multiple function units?

First of all, we have some “HDR” monitors around now that can output a >> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

I think bragging rights alone will see it grow beyond that. Look at tandem OLEDs.

Secondly, we’re talking about input image formats. Remember that every
image-processing step is going to introduce some generational loss due
to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.

That is why the arithmetic is done in 16-bits.

Heck no. We’re talking up to 64-bit floats now.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Fri Oct 17 22:22:44 2025

From Newsgroup: comp.arch

On 2025-10-17 3:03 a.m., Lawrence D’Oliveiro wrote:

POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to address in CTR and leave return address in LR).

Something I like about the PowerPC, link register do not detract from
the GPRs. A second link register would be handy.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@[email protected] to comp.arch on Fri Oct 17 22:29:28 2025

From Newsgroup: comp.arch

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits.
Bits beyond 8 would be for some sea creatures or viewable with special glasses?

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 22:16:52 2025

From Newsgroup: comp.arch

On 10/17/2025 5:20 PM, Lawrence D’Oliveiro wrote:

On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:

On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common
code to execute the actual loop bodies.

The SuperH also did this for the FPU:
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.

That was probably not cost-effective for scalar instructions, because it would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.

It was mostly needed, for example, when switching between Single and
Double precision, and sucked...

Though, can note that encoding for FPU ops looked like:
1111-nnnn-mmmm-ZZZZ
So, 16 possible 2R instructions for the FPU.

There was in effect, not enough encoding space to do the FPU well with
only 4 bits. So, the FPU encoding was modal.

Still did pretty well.

Now, seemingly RISC-V couldn't even manage with 25 bits, so effectively
burns 7x 25-bit blocks (or nearly 28 bits of entropy) on the FPU.

Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.

Yeah, but this works assuming that your vector ops are primarily mapped
to long-running loops.

In a lot of cases, you don't have this, and a large vector wont be usable.

Consider, you want to write a function to fragment larger primitives
into smaller primitives to minimize affine warping (where the number of
input and output primitives will differ and you don't know in advance
which primitives will fragment, etc). Likely, Cray-style vectors wont
really help you there (but short-vector SIMD will help).

Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Back in the days of Seymour Cray, his machines were getting useful
results out of vector lengths up to 64 elements.

Perhaps that was more a substitute for parallel processing.

Maybe. Just in my own experience, it seems to fizzle out pretty quickly.

Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively- parallel supers, or RISC machines etc. Maybe the parallelism was thought
to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)

I think they were also mostly intended for CFD and FEM simulations and similar, or stuff that is very regular (running the same math over a
whole lot of elements).

Short-vector SIMD was introduced along an entirely separate evolutionary path, namely that of bringing DSP-style operations into general-purpose
CPUs.

Could be.

Hadn't really looked that much into where SIMD came from originally.

Some stuff I had read implied that vector processing came first, but
then due to the limits of vector processing, supercomputers went over to
SIMD; and then Intel added MMX presumably as an imitation of these supercomputers, and it went from there.

It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.

Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the two were bound to have underlying features in common.

OK.

So, in this case, a truer analog of Cray style vectors would not be
variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.

But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units and superscalar execution.

Possibly.

As can be noted, it makes sense to allow some amount of superscalar over
the SIMD operations, but this gets limited by whatever is the most
limited resource.

In my project, this limit is mostly memory access.

I did some more benchmarks, and also noted that in my old laptop, it is
also mostly bound by memory access:
It can't do vector multiply-accumulate faster than it can read the
floating point data from memory and write back the results;
And, the smallest floating-point format it has is Binary32.

It is likely that to push either vector processing or SIMD to its full performance, one would need a massive amount of memory bandwidth.

Or, say, on a desktop PC, to get one 128-bit SIMD vector per clock
operating at 3.7GHz, would need roughly 90GB/sec of memory bandwidth,
not likely to happen anytime soon...

One can think the PC's CPU is flying when it does memcpy at 3.6 GB/sec
or so, nowhere near enough.

But, one thing that does help with relative performance in the face of a bandwidth limit (say, for NNs) is vectors with 8-bit elements and ~ 4-bit/element weights, and the ability to pipeline a lot of secondary
ops (such as vector conversions) in parallel with other instructions.

So, for example, if you can't do a memory load or store at the same time
as a SIMD op, but you can do SIMD vector-conversions in parallel with
SIMD ops or with memory accesses.

Maybe it’s time to look beyond RGB colours. I remember some “Photo” >>> inkjet printers had 5 or 6 different colour inks, to try to fill out
more of the CIE space. Computer monitors could do the same. Look at the
OpenEXR image format that these CG folks like to use: that allows for
more than 3 colour components, and each component can be a float --
even single-precision might not be enough, so they allow for double
precision as well.

IME:
The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).

First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.

Possibly.

My monitor has HDR, sorta, but I ended up not using it, as its effects
were mostly it seems:
Make image brighter in general
Like it turns up effective brightness setting;
Adds ringing artifacts around edges;
Cause screen image to flicker every few minutes or so.
Very annoying, like screen will just go black for a few seconds,
often once every few minutes.

Say, for example, with HDR turned on, a sudden sharp transition between
Red and Green or similar will result in an ugly black line and ringing artifacts.

Otherwise, if I wanted my monitor brighter, I could turn up the
brightness level some more (I have it at a level that it doesn't burn my eyes).

Kinda doesn't look great so not really worth it (vs leaving monitor in
LDR mode).

Seems more like a kind of gimmick.

The more useful form of HDR IME is to use floating-point rendering and
then render this out to LDR based on whatever is the current "exposure
level" in the 3D rendering or similar.

Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.

Possibly, but here we still don't usually need much more than RGB24 or similar.

Likewise, FP8U (E4.M4) is maybe pushing it a little on the low-end for
HDR, but basically works.

Meanwhile, a lot of late 1990s or early 2000s GPUs were like, "you are
gonna take RGB555 and you are gonna like it".

Like, say, I suspect the "Mobility Radeon 9000" in my older laptop is
probably internally using RGB555 or RGBA4444 for textures (also probably
with a 12|16 bit Z-Buffer), and reduced precision transform, ...

Though, it predates having support for shaders. Also a weird quirk that
if you encode DXT5 but then use the transparent-endpoint ordering from
DXT1, the block seems to decode as it would in DXT1 (so, for DXT5, it
needs to always use the opaque ordering to decode correctly).

Sure, you may think 64-bit floats must be overkill for this purpose; but these are artists you’re dealing with. ;)

Overkill is overkill.

They can just be happy that in these modern times we are (mostly) free
of indexed color and 16-color.

Actually kinda hard to do non-terrible graphics in 16 color. Also, one
may have to give up one of the colors for transparency. Typically, I had
used hi-magenta as transparent color.

But, sometimes, 16 colors is all you need.

Like, behold, a video of a game from 2024 (Crimson Diamond):
https://www.youtube.com/watch?v=3kOrATKd_Mc
Where the whole game uses 16-color graphics...

Had noted though that for me, IRL, monitors can't really represent real
life colors. Like, I live in a world where computer displays all have a
slight tint (with a similar tint and color distortion also applying to
the output of color laser printers; and a different color distortion for
inkjet printers).

That is always true; “white” is never truly “white”, which is why those
who work in colour always talk about a “white point” for defining what is meant by “white”, which is the colour of a perfect “black body” emitter at
a specific temperature (typically 5500K or above).

Yeah.

Indoor lights typically come in "warm white" and "cool white". I usually prefer "cool white", but "warm while" is more common.

Some people go on about how good LED looks vs CFL, but I actually
slightly prefer the look of CFL. Both types of lighting screw up the
colors, so it is a choice of which is "better", and in my case, I more
lean towards CFL and fluorescent.

Incandescent beat both of them, as did halogen (when the UV filter is in place, *). But, alas, pretty much no one uses halogen for indoor
lighting, so at this point it is just sort of a choice between LED and fluorescent because people have gone and taken the incandescent bulbs away.

*: Halogen looks good with a uv filter, and kinda terrible without a UV filter.

Pretty much no one else seems to notice though, alas...

Well, at this rate, maybe people will start trying to light their houses
with magenta grow lights, "looks white enough to me...".

Same basic issue, annoying when one seems to be the only person around
that sees something.

Or, basically, this issue: https://en.wikipedia.org/wiki/File:Led_grown_lights_useful.jpg

Except nearly all the new LED bulbs are kinda like this (albeit not
quite as extreme) and I am displeased. Like, man, I am not a plant, I
don't need to live under grow lights.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Fri Oct 17 22:56:59 2025

From Newsgroup: comp.arch

On 10/17/2025 9:29 PM, Robert Finch wrote:

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits.
Bits beyond 8 would be for some sea creatures or viewable with special glasses?

I don't think I can see much beyond 7 or 8 bits.

I can see the banding artifacts from RGB555.
I can slightly see the difference between LCD and a VGA CRT.
On an LCD, the dithering causes a slightly "gritty" look with gradients
that is absent with a CRT. Mostly, the banding or grit isn't enough to
be worth caring about.

But, RGB555 is still a big step up from indexed color, and "almost
mostly good enough", except when one needs an alpha channel or HDR (then
it kinda falls on its face).

Though, I had often still used a format that is either RGB555 or
RGB444_A3, as often good enough (has 5-bits/channel when opaque, or 4
bits when translucent, per pixel).

Bigger annoyance to me is a "tint" that permeates pretty much all the artificial displays, and that also the newer LED bulbs have also adopted.

Like, sort of a color that is like blue + yellow rather than a true
white. There is no way to get rid of this tint, as it is like the tint
in somehow itself a part of the RGB colorspace.

We didn't really have this issue with CFLs.

But, with monitors, I am at least mostly used to it; would prefer not to
have the real-world tinted as well though.

Had noted, inkjet printers don't have this particular issue though...
They instead have the issue that nearly all the colors they print are
biased towards looking crap-brown. Like, even if you use it to print a
solid magenta, it still somehow has a crap-brown tinge to it (and only
the plain white of non-printed parts of the page avoid this).

Though, casually looking at it, it may not be obvious in isolation. It
is more obvious if looking through a phone though: If the colors on a
printed page change drastically between the real-life page, and the
image seen through a phone screen, it is usually inkjet (and, if not, it
is color laser).

I suspect partly it is because color-laser (along with many plastic
products) having the same sort of green+blue cyan color often seen on
monitors and similar.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 23:22:37 2025

From Newsgroup: comp.arch

On 10/17/2025 4:52 PM, EricP wrote:

Stephen Fuld wrote:

On 10/17/2025 12:43 PM, BGB wrote:

On 10/17/2025 1:49 PM, Stephen Fuld wrote:

As I am sure others will verify, the compatible descendants of the
S/360 are alive in real hardware. While I expect there haven't been >>>> any "new name" customers in a long time, the fact that IBM still
introduces new chips every few years indicates that there is still a
market for this architecture, presumably by existing customer's
existing workload growth, and perhaps new applications related to
existing ones.

OK.

I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.

I have heard that idea several times before. I wonder where it came
from?

The AS400 cpu was replaced by Power and an emulation layer. https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

Yes, sort of. Perhaps because IBM replaced the AS/400 with power,
someone assumed (incorrectly) that they replaced all their proprietary
CPUs with it.

BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so IBM
"just" ported that software to Power. For the other stuff, while there
was a sort of emulation layer, but the first time a program was run, it
got silently recompiled to target the new architecture. Or something
like that.

The z-series was always a different cpu, but maybe they
shared development groups with Power. The stages of the
z15 core (2019) doesn't look anything like Power10 (2021).

Right.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Fri Oct 17 23:44:01 2025

From Newsgroup: comp.arch

On 10/17/2025 5:37 PM, MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >>>> Computer” -- because the one feature that really did become common was >>>> the larger register sets.

Larger register sets were common, but not universal.

Where is there an architecture you would class as “RISC”, but did not have
a “large” register set?

See Univac 1108

I am not sure what you are saying here. While the 1108 did have some characteristics of RISC, such as fixed length instructions,it had some decidedly non RISCy features such as mem + op instructions, optional
indirect memory addressing, and some instructions that could search
multiple memory locations. Its register architecture was a little odd,
but it wasn't small. There were essentially about 40 user registers,
though some (16) were arithmetic only, some, 15, memory address only
(sort of like Motorola 68K), but four of those actually "overlapped" the arithmetic registers so could be used for either, and some (15) only to
store data.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 06:44:06 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 22:16:52 -0500, BGB wrote:

On 10/17/2025 5:20 PM, Lawrence D’Oliveiro wrote:

Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of
operand tuples in exactly the same way.

Yeah, but this works assuming that your vector ops are primarily mapped
to long-running loops.

Maybe not. I recall in the Cray docs somewhere, that the break-even point
for vector operations was as small as a vector size of 2. That is, if you
had just two operand tuples, it was worth it to go through the vector- operation setup, instead of doing two sets of scalar operations.

So RISC-V probably takes a bit more setup with the additional
specification of operand types. But I suspect that will not move the break-even point up by, say, dozens of elements; probably only needs a few more elements to make it worthwhile.

Maybe that was just a software thing: the Cray machines had their own
architecture(s), which was never carried forward to the new massively-
parallel supers, or RISC machines etc. Maybe the parallelism was
thought to render deep pipelines obsolete -- at least in the early
years. (*Cough* Pentium 4 *Cough*)

I think they were also mostly intended for CFD and FEM simulations and similar, or stuff that is very regular (running the same math over a
whole lot of elements).

Also code breaking by Government spooks. There is a story of some guy in a presentation by Cray, who stood up at the back and stressed the importance
of having population-count instructions, while refusing to go into detail about what he would use them for or even who he was.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sat Oct 18 06:46:25 2025

From Newsgroup: comp.arch

MitchAlsup <[email protected]d> schrieb:

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

Lapack's basics have not changed, but it is still actively maintained,
with errors being fixed and new features added.

If you look at the most recent major release, you will see that a lot
is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
One important thing seems to be changes to 64-bit integers.

And I love changes like

- B = BB*CS + DD*SN
- C = -AA*SN + CC*CS
+ B = ( BB*CS ) + ( DD*SN )
+ C = -( AA*SN ) + ( CC*CS )

which makes sure that compilers don't emit FMA instructions and
change rounding (which, apparently, reduced accuracy enormously
for one routine.

(According to the Fortran standard, the compiler has to honor
parentheses).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 06:46:44 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits. Bits beyond 8 would be for some sea creatures or viewable with special
glasses?

Under ideal conditions (comparing large areas), the human eye can
distinguish about 10 million colours. Round that up to 2**24, and you get
the traditional 8-by-8-by-8 RGB “full colour” space.

However, consider your eye’s ability to adapt to a dynamic range from a
dim room out into bright sunlight. Now imagine trying to simulate some of
that in a movie, and you can see why the video images will need more than 8-by-8-by-8 dynamic range.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 06:54:26 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 23:22:37 -0700, Stephen Fuld wrote:

BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so IBM "just" ported that software to Power.

There was some custom microcode added to the POWER chips specifically for
the iSeries machines. I remember seeing a YouTube video where the
presenter tried to make sense of some disassembled machine code -- it was mostly recognizable as POWER instructions, but the extras were not
documented publicly anywhere.

Might have been one of the videos on this channel <https://www.youtube.com/@MatthewMainframes/videos>, but I’m not sure.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sat Oct 18 06:58:19 2025

From Newsgroup: comp.arch

David Brown <[email protected]> schrieb:

On 17/10/2025 08:48, Lawrence D’Oliveiro wrote:

On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many >>> memory aliasing issues to use the vector ISA.

Is this why C99 (and later) has the “restrict” qualifier
<https://en.cppreference.com/w/c/language/restrict.html>?

"restrict" can significantly improve non-vectored code too, as well as
more "ad-hoc" vectoring of code where the compiler uses general-purpose registers, but interlaces loads, stores and operations to improve pipelining. But it is certainly a very useful qualifier for vector code.

You can apply it to arguments, but then you cannot use other
pointers as "shorthand", so

void foo(int *restrict a)
{
int *restrict b = a;
// Do something with b
}

is undefined.

Fortran has it simpler: Arguments cannot alias each other, or
things from COMMON blocks, or ... unless explicitly declared TARGET.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:05:41 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <[email protected]> posted:

Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate
lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.

In practice:: RSQRT() is no harder to compute {both HW and SW},
yet:: RSQRT() is more useful::

SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV

1/x = RSQRT(x)*RSQRT(x), also just one FMUL

Useful in vector normalization::

some-vector-calculation
-----------------------
SQRT( SUM(x**2,1,n) )

and a host of others.

Your last example is where I got involved with the issue: A Computation
Fluid Chemistry researcher from Sweden reached out, he wanted to speed
up Sqrt() which he believed to be the bottleneck when calculating the reciprocal distance for all his chemical force estimates.

After looking at his source code, it was obvious that by directly
calculating 1/sqrt(sum of squares), the speedup would be much more significant.

In the end I created a function which calculated three RSqrt() values in parallel, this was by far the most common use case for any reaction
taking place in a H2O solution, and it allowed almost all the latency
delays to be overlapped between the three copies of the pipeline.

In the end, his week-long simulations (running on Alpha and PentiumPro
cpus) ran in exactly half the time so now he could double the number of
runs.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:21:32 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

Short-vector SIMD was introduced along an entirely separate evolutionary
path, namely that of bringing DSP-style operations into general-purpose
CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of typically 8
and 16-bit elements, it was the enabler for sw DVD decoding. ZoranDVD
was the first to properly handle 30 frames/second with zero skips, it
needed a PentiumMMX-200 to do so.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:25:16 2025

From Newsgroup: comp.arch

Stephen Fuld wrote:

On 10/17/2025 4:52 PM, EricP wrote:

The AS400 cpu was replaced by Power and an emulation layer.
https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

Yes, sort of. Perhaps because IBM replaced the AS/400 with power,
someone assumed (incorrectly) that they replaced all their proprietary > CPUs with it.

BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so IBM "just" ported that software to Power. For the other stuff, while there
was a sort of emulation layer, but the first time a program was run, it
got silently recompiled to target the new architecture. Or something
like that.

I consider AS/400 to be the blueprint for Mill's choice to have a model-portable distribution format that goes through the specializer in
order to be compatible with the actual CPU model it is now running on.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Sat Oct 18 10:33:13 2025

From Newsgroup: comp.arch

Lawrence D’Oliveiro wrote:

On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits. Bits
beyond 8 would be for some sea creatures or viewable with special
glasses?

Under ideal conditions (comparing large areas), the human eye can
distinguish about 10 million colours. Round that up to 2**24, and you get
the traditional 8-by-8-by-8 RGB “full colour” space.

10 million is more than what I've heard/seen, but OK:
More interesting is the fact that females tend to have about 10x the
ability to distinguish colors compared to men, due to the fact that the blue-green receptors are tied to the X chromosome, and they don't have
to be exactly the same. I know this is true for my wife and me, but on
the other hand I have much better monochrome vision so I can see better
when it is quite dark.

However, consider your eye’s ability to adapt to a dynamic range from a
dim room out into bright sunlight. Now imagine trying to simulate some of that in a movie, and you can see why the video images will need more than 8-by-8-by-8 dynamic range.

In reality they don't even (really) try. :-)
Many years ago, they even had to shoot all night-time scenes during the
day because the film and cameras didn't have nearly enough dynamic range.Terje --
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@[email protected] (Anton Ertl) to comp.arch on Sat Oct 18 08:27:14 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:

Where is there an architecture you would class as "RISC", but did not have
a "large" register set?

(How "large" is "large"? The VAX had 16 registers; was there any RISC >architecture with only that few?)

The first IBM 801 has 16 registers. ARM A32/T32 has 16 registers (and
shares the VAX's mistake of making the PC accessible as GPR). RV32E
(and, I think, RV64E) has 16 registers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Sat Oct 18 04:46:43 2025

From Newsgroup: comp.arch

On 10/18/2025 3:33 AM, Terje Mathisen wrote:

Lawrence D’Oliveiro wrote:

On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits. Bits
beyond 8 would be for some sea creatures or viewable with special
glasses?

Under ideal conditions (comparing large areas), the human eye can
distinguish about 10 million colours. Round that up to 2**24, and you get
the traditional 8-by-8-by-8 RGB “full colour” space.

10 million is more than what I've heard/seen, but OK:

More interesting is the fact that females tend to have about 10x the
ability to distinguish colors compared to men, due to the fact that the blue-green receptors are tied to the X chromosome, and they don't have
to be exactly the same. I know this is true for my wife and me, but on
the other hand I have much better monochrome vision so I can see better
when it is quite dark.

I seem to have a quirk that I see best in dim conditions; but am nearly blinded in direct sunlight (but, can see better with shade-4 or shade-5 glasses).

For me, shade-5 seems to work best for daytime conditions:
Shade 4 isn't quite dark enough;
Shade 7 is too dark (difficult to see effectively with shade 7).

Hard to find shade-5 glasses that aren't strongly tinted (usually
green), found some that (merely) have a yellow tint, still better than monochromatic green.

Though, despite some drawbacks (like being mostly monochromatic green),
some shade-5 welding goggles are otherwise pretty effective at defeating
the sun (and not letting light in from the side). Some dark sunglasses
still have more of an issue with light leakage (where if light leaks in
from the side and bounces off the inside of the lens, this isn't ideal
for visibility). But, then it is also hard finding shade-5 glasses that
aren't green, so, ...

Most of the normal sunglasses aren't really dark enough (they need to be
dark enough to be effective).

Had noted, the outdoor conditions where I see best (which ironically
looks the most like typical images of daytime conditions) is after the
sun has set, but before it has gotten dark.

So, say: Real daytime: nearly everything that the sun hits is covered in
a white haze.

Full night-time conditions are still dark though.
So, alas, still no ability to see particularly well in night-time
conditions either.

However, consider your eye’s ability to adapt to a dynamic range from a
dim room out into bright sunlight. Now imagine trying to simulate some of
that in a movie, and you can see why the video images will need more than
8-by-8-by-8 dynamic range.

In reality they don't even (really) try. :-)

Yes.

Many years ago, they even had to shoot all night-time scenes during the
day because the film and cameras didn't have nearly enough dynamic range.

In the conditions I see best, things like my cellphone camera have a
hard time taking good pictures (the images are dark and often have
significant noise).

Like, my room is at an OK light level for me, but phone sees it all as
dark and grainy. Room is lit by a CFL bulb in an overhead holder
(current bulb is 50W equivalent IIRC).

Seemingly one has to put something under a bright lame (uncomfortably
bright) before the phone camera can get an image that isn't dark and noisy.

I can still see OK in situations where phone cameras just mostly give an
all black image.

However, the phone camera can see things better in brightly lit
conditions than I can.

Though, sometimes extra light can help, for example, although a little unpleasantly bright, using a lamp for things like soldering can be
helpful (say, the lamp having a 40W equivalent CFL bulb).

I once had a sort of mini desk lamp lit by a smaller bulb, I don't have
it now. But, a lot of similar bulbs exist on Amazon.

I don't see many with the same type of design, but the 2.5W E10 bulbs
appear to be a similar category, and are fairly readily available.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@[email protected] to comp.arch on Sat Oct 18 07:31:54 2025

From Newsgroup: comp.arch

On 10/18/2025 1:25 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 10/17/2025 4:52 PM, EricP wrote:

The AS400 cpu was replaced by Power and an emulation layer.
https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

Yes, sort of. Perhaps because IBM replaced the AS/400 with power,
someone assumed (incorrectly) that they replaced all their proprietary
CPUs with it.

BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so
IBM "just" ported that software to Power. For the other stuff, while
there was a sort of emulation layer, but the first time a program was
run, it got silently recompiled to target the new architecture. Or
something like that.

I consider AS/400 to be the blueprint for Mill's choice to have a model- portable distribution format that goes through the specializer in order
to be compatible with the actual CPU model it is now running on.

Absolutely. I think Ivan has even said so.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Sat Oct 18 17:16:00 2025

From Newsgroup: comp.arch

On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:

On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

First of all, we have some “HDR” monitors around now that can output a >>> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

I think bragging rights alone will see it grow beyond that. Look at tandem OLEDs.

Like many things, human perception of brightness is not linear - it is somewhat logarithmic. So even though we might not be able to
distinguish anywhere close to 2000 different nuances of one primary
colour, we /can/ perceive a very wide dynamic range. Having a large
number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@[email protected] to comp.arch on Sat Oct 18 13:16:17 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 20:54:23 GMT, MitchAlsup
<[email protected]d> wrote:

George Neuner <[email protected]> posted:

Hope the attributions are correct.

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
<[email protected]d> wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

:

In any case, even with these languages there are still software projects
that fail, miss their deadlines and have overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out how to >> >make the (17 kinds of) hammers one needs, there it little need to make a
new hammer architecture.

Windows could have stopped at W7, and many MANY people would have been
happier... The mouse was more precise in W7 than in W8 ... With a little >> >upgrade for new PCIe architecture along the way rather than redesigning
whole kit and caboodle for tablets and phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand >> >for the product, but pushed by companies to make already satisfied users
have to upgrade.

Those programmers could have transitioned to new SW projects rather than
redesigning the same old thing 8 more times. Presto, there is now enough
well trained SW engineers to tackle the undone SW backlog.

The problem is that decades of "New & Improved" consumer products have
conditioned the public to expect innovation (at minimum new packaging
and/or advertising) every so often.

Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by many
current developers due to concern that the project has been abandoned.
That perception likely would not change even if the author(s)
responded to inquiries, the library was suitable "as is" for the
intended use, and the lack of recent updates can be explained entirely
by a lack of new bug reports.

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

Most Floating Point Libraries are in a similar position. They were
updated after IEEE 754 became widespread and are as good today as
ever.

{FF1, Tomography, CFD, FEM} have needed no real changes in decades.

Sometimes, Software is "done". You may add things to the package
{like a new crescent wrench} but the old hammer works just as well
today as 30 years ago when you bought it.

I agree completely! However, numeric libraries are not what the
average developer is looking for. For every 1 looking for a numerics
library, there are 100,000 looking for some kind of web function,
editing, data interchange, or database library.

Why take a chance?

On the last day of SW support for W10--they (THEY) updated several
things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!

Yeah, that happens too.

To the SW vendor, they want to be able to update their SW any time
they want. Yet, the application user wants the same bugs to remain
constant over the duration of the WHOLE FRIGGEN project--because
once you found them and figured a way around them, you don't want
them to reappear somewhere else !!!

There simply _must_ be a similar project somewhere
else that still is actively under development. Even if it's buggy and
unfinished, at least someone is working on it.

I understand--but this bites more often than the conservative approach.

YMMV but, as a software developer myself, this attitude makes me sick.
8-(

I was in a 3-year project where we had to forgo upgrading from SunOS
to Solaris because the SW license model changes would have put us out
of business before project completion.

And that also. Clearly if the economics of the <whatsit> changes, you
have to re-evaluate using it.

Company I worked for had a handful of Sparc 5s running Solaris. We
only used them in connection with board level debugger which we needed
for developing some embedded projects running VxWorks on 68K VME. The
Sparcs monitored the VME module and allowed replaying system level
events to figure out what led to <whatever was going on>.

I overheard the manager complaining that we could buy 3-4 top of line
Pentium workstations for the cost of each Sparc. Unfortunately - at
that time - the debugger/monitor software didn't run on x86. A few
years later there was an x86 version introduced, but, by that time, we
weren't doing anything that needed it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Sat Oct 18 20:25:08 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 08:27:14 GMT
[email protected] (Anton Ertl) wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> writes:

Where is there an architecture you would class as "RISC", but did
not have a "large" register set?

(How "large" is "large"? The VAX had 16 registers; was there any
RISC architecture with only that few?)

The first IBM 801 has 16 registers. ARM A32/T32 has 16 registers (and
shares the VAX's mistake of making the PC accessible as GPR). RV32E
(and, I think, RV64E) has 16 registers.

- anton

I wouldn't count 801, because it's was a concept rather than production
CPU. But ROMP does count. Not success, but a product nevertheless.
Another (apart from ARM) succesful RISC with small register file is
Hitachi SH.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Sat Oct 18 21:42:17 2025

From Newsgroup: comp.arch

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <[email protected]d> wrote:

George Neuner <[email protected]> posted:

Hope the attributions are correct.

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup <[email protected]d> wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

:

In any case, even with these languages there are still
software projects that fail, miss their deadlines and have
overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out
how to make the (17 kinds of) hammers one needs, there it little
need to make a new hammer architecture.

Windows could have stopped at W7, and many MANY people would have
been happier... The mouse was more precise in W7 than in W8 ...
With a little upgrade for new PCIe architecture along the way
rather than redesigning whole kit and caboodle for tablets and
phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998,
... and few people would have cared. Many SW projects are driven
not by demand for the product, but pushed by companies to make
already satisfied users have to upgrade.

Those programmers could have transitioned to new SW projects
rather than redesigning the same old thing 8 more times. Presto,
there is now enough well trained SW engineers to tackle the undone
SW backlog.

The problem is that decades of "New & Improved" consumer products
have conditioned the public to expect innovation (at minimum new
packaging and/or advertising) every so often.

Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by
many current developers due to concern that the project has been
abandoned. That perception likely would not change even if the
author(s) responded to inquiries, the library was suitable "as is"
for the intended use, and the lack of recent updates can be
explained entirely by a lack of new bug reports.

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

It is possible that LAPAC API was not updated in decades, although I'd
expect that even at API level there were at least small additions, if
not changes. But if you are right that LAPAC implementation was not
updated in decade than you could be sure that it is either not used by
anybody or used by very few people.

Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS. Both libraries
are not just updated, but more like permanently re-written.
I'm pretty sure that the same applies to Apple's implementations
of BLAS and LAPAC.
And, of course, it apply GPGPU implementation both from NV and from AMD
and more recently from Intel as well.

Most Floating Point Libraries are in a similar position. They were
updated after IEEE 754 became widespread and are as good today as
ever.

{FF1, Tomography, CFD, FEM} have needed no real changes in decades.

Sometimes, Software is "done". You may add things to the package
{like a new crescent wrench} but the old hammer works just as well
today as 30 years ago when you bought it.

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

Why take a chance?

On the last day of SW support for W10--they (THEY) updated several
things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!

To the SW vendor, they want to be able to update their SW any time
they want. Yet, the application user wants the same bugs to remain
constant over the duration of the WHOLE FRIGGEN project--because
once you found them and figured a way around them, you don't want
them to reappear somewhere else !!!

There simply _must_ be a similar project
somewhere else that still is actively under development. Even if
it's buggy and unfinished, at least someone is working on it.

I understand--but this bites more often than the conservative
approach.

YMMV but, as a software developer myself, this attitude makes me
sick. 8-(

I was in a 3-year project where we had to forgo upgrading from SunOS
to Solaris because the SW license model changes would have put us out
of business before project completion.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sat Oct 18 19:24:21 2025

From Newsgroup: comp.arch

Michael S <[email protected]> schrieb:

It is possible that LAPAC API was not updated in decades,

The API of existing LAPACK routines was not changed (AFAIK),
but there were certainly additions. It is also possible to chose
64-bit integers at build time.

although I'd
expect that even at API level there were at least small additions, if
not changes. But if you are right that LAPAC implementation was not
updated in decade than you could be sure that it is either not used by anybody or used by very few people.

It is certainly in use by very many people, if indirectly, for example
by Python or R. I learned about R the hard way, when a wrong interface
in the C bindings of Lapack surfaced after a long, long time.

Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.

Different level of application. You use LAPACK when you want to do
things like calculating eigenvalues or singular value decomposition,
see https://www.netlib.org/lapack/lug/node19.html . If you use
BLAS directly, you might want to check if there is a routine
in LAPACK which does what you need to do.

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

I agree. There is a _lot_ of active research in numerical
algorithms, be it for ODE systems, sparse linear solvers or whatnot.
A lot of that is happening in Julia, actually.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@[email protected] to comp.arch on Sat Oct 18 19:36:17 2025

From Newsgroup: comp.arch

According to EricP <[email protected]>:

I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.

I have heard that idea several times before. I wonder where it came from?

The AS400 cpu was replaced by Power and an emulation layer. >https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC

The S/38 and AS/400 had a virtual instruction set called TIMI which
was translated into native code the first time a program is run.
They didn't write an emulation layer. They just wrote a new
translator to POWER rather than to the previous low level architecture.

I gather most phones do the same thing, translating JVM or ART code to
native code when it installs an app.

The z-series was always a different cpu, but maybe they
shared development groups with Power. The stages of the
z15 core (2019) doesn't look anything like Power10 (2021).

https://www.servethehome.com/wp-content/uploads/2020/08/Hot-Chips-32-IBM-Z15-Processor-Pipeline.jpg

I would expect them to be different since z has to run S/360 code which is rather different from
POWER code.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@[email protected] to comp.arch on Sat Oct 18 23:11:38 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

It is possible that LAPAC API was not updated in decades,

The API of existing LAPACK routines was not changed (AFAIK),
but there were certainly additions. It is also possible to chose
64-bit integers at build time.

although I'd
expect that even at API level there were at least small additions,
if not changes. But if you are right that LAPAC implementation was
not updated in decade than you could be sure that it is either not
used by anybody or used by very few people.

It is certainly in use by very many people, if indirectly, for example
by Python or R.

Does Python (numpy and scipy, I suppose) or R linked against
implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch? Somehow, I don't believe it.
I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
for Matlab.

I learned about R the hard way, when a wrong
interface in the C bindings of Lapack surfaced after a long, long
time.

Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.

Different level of application. You use LAPACK when you want to do
things like calculating eigenvalues or singular value decomposition,
see https://www.netlib.org/lapack/lug/node19.html . If you use
BLAS directly, you might want to check if there is a routine
in LAPACK which does what you need to do.

Higher-level algos I am interested in are mostly our own inventions.
I can look, of course, but the chances that they are present in LAPACK
are very low.
In fact, Even BLAS L3 I don't use all that often (and lower levels
of BLAS never).
Not because APIs do not match my needs. They typpically do. But
because standard implementations are optimized for big or huge matrices.
My needs are medium matrices. A lot of medium matrices.
My own implementations of standard algorithms for medium-sized
matrices, most importantly of Cholesky decomposition, tend to be much
faster than those in OTS BLAS librares. And preparatioon of my own
didn't take a lot of time. After all those are simple algorithms.

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

I agree. There is a _lot_ of active research in numerical
algorithms, be it for ODE systems, sparse linear solvers or whatnot.
A lot of that is happening in Julia, actually.

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@[email protected] (Waldek Hebisch) to comp.arch on Sat Oct 18 22:10:35 2025

From Newsgroup: comp.arch

Lawrence D’Oliveiro <[email protected]d> wrote:

On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:

On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <[email protected]d> wrote:

From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >>> Computer” -- because the one feature that really did become common was >>> the larger register sets.

Larger register sets were common, but not universal.

Where is there an architecture you would class as “RISC”, but did not have
a “large” register set?

(How “large” is “large”? The VAX had 16 registers; was there any RISC
architecture with only that few?)

Cortex M0 has only 8 general purpose registers. There are 8 other
ARM registers, but on Cortex M0 they can be used only by selected
instructions.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sat Oct 18 22:20:14 2025

From Newsgroup: comp.arch

Speaking of Cray, the US Mint are issuing some new $1 coins featuring
various famous persons/things, and one of them has a depiction of the
Cray-1 on it.

From the photo I’ve seen, it’s an overhead view, looking like a
stylized letter C. So I wonder, even with the accompanying legend
“CRAY-1 SUPERCOMPUTER”, how many people will realize that’s actually a picture of the computer?

<https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@[email protected] (Waldek Hebisch) to comp.arch on Sat Oct 18 22:22:32 2025

From Newsgroup: comp.arch

David Brown <[email protected]> wrote:

On 16/10/2025 23:26, BGB wrote:

On 10/16/2025 2:04 AM, David Brown wrote:

On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better >>>>>> than SIMD, if only because it preserves the “R” in “RISC”.

The R in RISC-V comes from "student _R_esearch".

“Reduced Instruction Set Computing”. That was what every single
primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or >>>>> vice versa)--they simply represent different ways of shooting yourself >>>>> in the foot.

The primary design criterion, as I understood it, was to avoid
filling up
the instruction opcode space with a combinatorial explosion. (Or
sequence
of combinatorial explosions, when you look at the wave after wave of
SIMD
extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on
different hardware. With SIMD, you need different code if your
processor can add 4 ints at a time, or 8 ints, or 16 ints - it's all
different instructions using different SIMD registers. With the
vector style instructions in RISC-V, the actual SIMD registers and
implementation are not exposed to the ISA and you have the same code
no matter how wide the actual execution units are. I have no
experience with this (or much experience with SIMD), but that seems
like a big win to my mind. It is akin to letting the processor
hardware handle multiple instructions in parallel in superscaler cpus,
rather than Itanium EPIC coding.

But, there is problem:
Once you go wider than 2 or 4 elements, cases where wider SIMD brings
more benefit tend to fall off a cliff.

More so, when you go wider, there are new problems:
Vector Masking;
Resource and energy costs of using wider vectors;
...

I appreciate that. Often you will either be wanting the operations to
be done on a small number of elements, or you will want to do it for a
large block of N elements which may be determined at run-time. There
are some algorithm, such as in cryptography, where you have sizeable but fixed-size blocks.

When you are dealing with small, fixed-size vectors, x86-style SIMD can
be fine - you can treat your four-element vectors as single objects to
be loaded, passed around, and operated on. But when you have a large run-time count N, it gets a lot more inefficient. First you have to
decide what SIMD extensions you are going to require from the target,
and thus how wide your SIMD instructions will be - say, M elements.
Then you need to loop N / M times, doing M elements at a time. Then you need to handle the remaining N % M elements - possibly using smaller
SIMD operations, possibly doing them with serial instructions (noting
that there might be different details in the implementation of SIMD and serial instructions, especially for floating point).

In many cases one can enlarge data structures to multiple of SIMD
vector size (and align them properly). There requires some extra
code, but mot too much and all of it is outside inner loop. So,
there is some waste, but rather small due to unused elements.

Of course, there is still trouble due to different SIMD vector
size and/or different SIMD instructions sets.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:08:58 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 19:24:21 -0000 (UTC), Thomas Koenig wrote:

[LAPACK] is certainly in use by very many people, if indirectly, for
example by Python or R.

Certainly used by NumPy:

ldo@theon:~> apt-cache depends python3-numpy
python3-numpy
...
|Depends: libblas3
Depends: <libblas.so.3>
libblas3
libblis4-openmp
libblis4-pthread
libblis4-serial
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
|Depends: liblapack3
Depends: <liblapack.so.3>
liblapack3
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:11:37 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 23:11:38 +0300, Michael S wrote:

I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave
uses relatively new implementations, and pretty sure that the same
goes for Matlab.

On my system, Octave uses exactly the same version of LAPACK as NumPy
does:

ldo@theon:~> apt-cache depends octave
octave
...
Depends: <libblas.so.3>
libblas3
libblis4-openmp
libblis4-pthread
libblis4-serial
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
|Depends: liblapack3
Depends: <liblapack.so.3>
liblapack3
libopenblas0-openmp
libopenblas0-pthread
libopenblas0-serial
...
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:17:19 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial “killer app” for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@[email protected] to comp.arch on Sun Oct 19 01:20:16 2025

From Newsgroup: comp.arch

On Sat, 18 Oct 2025 22:22:32 -0000 (UTC), Waldek Hebisch wrote:

In many cases one can enlarge data structures to multiple of SIMD vector
size (and align them properly). There requires some extra code, but mot
too much and all of it is outside inner loop. So, there is some waste,
but rather small due to unused elements.

Of course, there is still trouble due to different SIMD vector size
and/or different SIMD instructions sets.

Just so long as you keep such optimized data structures *internal* to the program, and don’t make them part of any public interchange format!

Interchange formats tend to outlive the original technological milieu they were created in, and decisions made for the sake of technical limitations
of the time can end up looking rather ... anachronistic ... just a few
years down the track.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Sun Oct 19 01:56:03 2025

From Newsgroup: comp.arch

On 10/18/2025 10:16 AM, David Brown wrote:

On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:

On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

First of all, we have some “HDR” monitors around now that can output a >>>> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.

It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.

I think bragging rights alone will see it grow beyond that. Look at
tandem
OLEDs.

Like many things, human perception of brightness is not linear - it is somewhat logarithmic. So even though we might not be able to
distinguish anywhere close to 2000 different nuances of one primary
colour, we /can/ perceive a very wide dynamic range. Having a large
number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.

Possible, but it is a question if high bit depth would make much
difference. We are still in a case where HDMI usually sends 8 or
sometimes 10 bits per channel, but displays are generally limited to 5
or 6 bits (and may then dither stuff on the display side).

Then we have:
Traditional LCD: Uses a fluorescent backlight;
LED: Typically LCD + LED backlights;
OLED: Panel itself uses LEDs
Typically much more expensive;
Notoriously short lifespan.

I have a display, LED+LCD tech, it has an HDR mode, but it isn't great.
As noted, it seems like it mostly turns up the brightness and uses image processing wonk (which adds a bunch of artifacts).

And, if I wanted 25% brighter, I could turn the brightness setting from
40 to 50 or similar (checks, current settings being 40% brightness, 60% contrast).

Then, we have HDR in 3D rendering which is, as noted, not usually about
the monitor, but about using floating-point for rendering (typically
with LDR for the final output).

Often it still makes sense to use LDR for textures, but then HDR for the framebuffer (since the HDR is usually more a product of the lighting
than the materials).

Binary16 is plenty of precision for framebuffer.
Though, often FP8U (E4.M4) is likely to still be acceptable.

Where:
E3.M5: Not really enough dynamic range.
E4.M4: OK (Comparable to RGB555)
E5.M3: Image quality is poor (worse than RGB555).

We usually give up sign with smaller formats, assuming that any values
which would go negative are clamped to 0, as it is harder in this case
to justify spending a bit on being able to represent negative colors.

For native Binary16, may as well allow negatives.

There is a question of the best way to store HDR images:
4x FP16: High quality, but expensive
4x FP8U: More affordable, can do RGBA
RGB8_E8: good for opaque images, works OK.
RGB8_EA4: OK, non-standard.
RGB9_E5: Good for opaque images
RG11_B10: E5.M6 | E5.M5

For files, currently ignoring EXR, but this is typically similar tech to
the TGA format in most cases (raw floats, or maybe with RLE, very
bulky). There are other options, but when I encountered EXR images in
the past, they were being used basically like the TGA format.

For a format like my UPIC design, could likely (in theory) handle
components of up to around 14 bits. Problem becomes the range of
quantizer values, where at high bit-depths an 8-bit quantization table
value may be no longer sufficient.

In this case, the limiting factor is that A-B needs to stay within int16
range (both the internal buffers and coefficient encoding maxes out at
int16 range).

For T.81 JPEG, there are a rarely used variants that have 10 and 12 bit components (where, JPEG has a lot of the same basic issues here).
Though, a lot of what people assume are the limits of T.81 JPEG, are
actually the limits of JFIF.

With either format, using 12 bits makes sense, as this isn't too far
outside the range of the 8-bit quantization values (mostly sets a limit
to how low of quality 0% can achieve; though likely does mean likely
scaling the quantizer values by 8x vs whatever they would be for that
quality level with LDR, and clamping them between 1 and 255).

So, one possibility could be, say:
Image can represent values as 12 bits: E5.M7

Or, maybe allow negative components as well, likely in ones' complement
form. Though, this would be unusual if using JPEG as a base as they tend
not to use negative components even if nothing in the design of the
format necessarily prevents the use of negative components.

Depending on needs, could be decoded as Binary16 or as one of the other formats.

Though, another option is to just store the images with 8-bit E4.M4
components (so, from the codec's POV, it is the same as with an LDR image).

Then again, someone might want lossless Binary16, but my UPIC format
couldn't do this as-is, since doing so would exceed current value ranges.

I would likely need to hack the VLC scheme to allow for larger coefficients.

As-is, table looks like (V prefix, extra bits, unsigned range)
0/ 1, 0, 0.. 1 2/ 3, 0, 2.. 3
4/ 5, 1, 4.. 7 6/ 7, 2, 8.. 15
8/ 9, 3, 16.. 31 10/11, 4, 32.. 63
12/13, 5, 64.. 127 14/15, 6, 128.. 255
16/17, 7, 256.. 511 18/19, 8, 512.. 1023
20/21, 9, 1024.. 2047 22/23, 10, 2048.. 4095
24/25, 11, 4096.. 8191 26/27, 12, 8192..16383
28/29, 13, 16384..32767 30/31, 14, 32768..65535

So, with the zigzag folding, this expresses a 16-bit range.

Both the Block-Haar and RCT effectively cost 1 bit of dynamic range,
meaning as-is, leaving the widest allowed component as 14-bits (signed
range).

Though, one possibility would be hacking the upper end of the table (not otherwise used for LDR images) to use a steeper step with a 16-bit
components range, say:
24, 12, 4096.. 8191
25, 13, 8192.. 16383
26, 14, 16384.. 32767
27, 15, 32768.. 65536
28, 16, 65536.. 131071
29, 17, 131072.. 262143
30, 18, 262144.. 524287
31, 19, 524288..1048575

Which (if using 32-bits for transform coefficients) would exceed the
dynamic range needed for 16-bit coefficients (roughly +/- 262144 if unbalanced).

Might need to define a special case for 16-bit quantization tables to
allow for effective lossy compression though. Most naive option is that,
if the quantization table has 128 bytes of payload (vs 64) it is assumed
to use 16-bit components.

Well, and then one can debate whether RCT, Haar, etc, are still the best options. Well, and (if 12 bit components were used), how the VLC scheme
would be understood (or if Binary16 would effectively preclude such a
12-bit encoding scheme as redundant).

May or may not have a use-case for such a thing, TBD.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Sun Oct 19 07:55:57 2025

From Newsgroup: comp.arch

Michael S <[email protected]> schrieb:

On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

It is possible that LAPAC API was not updated in decades,

The API of existing LAPACK routines was not changed (AFAIK),
but there were certainly additions. It is also possible to chose
64-bit integers at build time.

although I'd
expect that even at API level there were at least small additions,
if not changes. But if you are right that LAPAC implementation was
not updated in decade than you could be sure that it is either not
used by anybody or used by very few people.

It is certainly in use by very many people, if indirectly, for example
by Python or R.

Does Python (numpy and scipy, I suppose) or R linked against
implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch?

No, they don't (as I learned). They would cut themselves off
from all the improvements and bug fixes since then.

Somehow, I don't believe it.
I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
for Matlab.

I would be surprised otherwise.

Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.

Different level of application. You use LAPACK when you want to do
things like calculating eigenvalues or singular value decomposition,
see https://www.netlib.org/lapack/lug/node19.html . If you use
BLAS directly, you might want to check if there is a routine
in LAPACK which does what you need to do.

Higher-level algos I am interested in are mostly our own inventions.
I can look, of course, but the chances that they are present in LAPACK
are very low.
In fact, Even BLAS L3 I don't use all that often (and lower levels
of BLAS never).
Not because APIs do not match my needs. They typpically do. But
because standard implementations are optimized for big or huge matrices.
My needs are medium matrices. A lot of medium matrices.
My own implementations of standard algorithms for medium-sized
matrices, most importantly of Cholesky decomposition, tend to be much
faster than those in OTS BLAS librares. And preparatioon of my own
didn't take a lot of time. After all those are simple algorithms.

For the same reason, I implemented unrolling of MATMUL for small
matrices in gfortran a few years ago. If all you are doing are
small matrices (especially of constant size), the compiler can
do a better job from straight loop. By the time the optimized
matmul routines have started up their machinery, the calculation
is already done.

I had to be careful about benchmarking, though. I had to hide the
fact that I was not actually using the results from the compiler,
otherwise I got extremely fast execution times for what was
essentially a no-op. My standard method now is to select a pair
of array indices where the compiler cannot see them (read from
a string) and then write out a single element at that position,
also to a string.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Sun Oct 19 16:52:12 2025

From Newsgroup: comp.arch

Lawrence D’Oliveiro wrote:

Speaking of Cray, the US Mint are issuing some new $1 coins featuring
various famous persons/things, and one of them has a depiction of the
Cray-1 on it.

From the photo Iâ€™ve seen, itâ€™s an overhead view, looking like a
stylized letter C. So I wonder, even with the accompanying legend â€œCRAY-1 SUPERCOMPUTERâ€, how many people will realize thatâ€™s actually a
picture of the computer?

<https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>

My guess: Well below 0.1% unless they get told what it is.
It was not obvious to me, and I have sat on the Cray bench several
times, both in Trondheim (in active use at the time) and in the Computer History Museum in Silicon Valley man years later. (Maybe the latter is a faulty recollection, and I only got to look at it at that time? It was
during a private showing of the collection.)
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun Oct 19 19:31:50 2025

From Newsgroup: comp.arch

Robert Finch <[email protected]> posted:

It is unlikely that monitors will ever get much beyond 11-bits of pixel depth per color.

I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits.

My point was that there is a physical limit on how closely one can
illuminate a colored pixel--and that limit is around 11-bits. Just
like there is a limit on how good one can make an A/D converter which
is around 22-bits.

I did not imply that a person could SEE that fine a granularity, just
that one could build a screen that had that fine a granularity.

Bits beyond 8 would be for some sea creatures or viewable with special glasses?

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun Oct 19 19:37:03 2025

From Newsgroup: comp.arch

Thomas Koenig <[email protected]> posted:

MitchAlsup <[email protected]d> schrieb:

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

Lapack's basics have not changed, but it is still actively maintained,
with errors being fixed and new features added.

If you look at the most recent major release, you will see that a lot
is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
One important thing seems to be changes to 64-bit integers.

And I love changes like

- B = BB*CS + DD*SN
- C = -AA*SN + CC*CS
+ B = ( BB*CS ) + ( DD*SN )
+ C = -( AA*SN ) + ( CC*CS )

which makes sure that compilers don't emit FMA instructions and
change rounding (which, apparently, reduced accuracy enormously
for one routine.

FFT is sensitive to NOT using FMAC--that is the error across
butterflies is lower with FMUL FMUL and FADD than FMUL FMAC.
This has to do with distributing the error evenly whereas FMAC
makes one of the calculations better.

(According to the Fortran standard, the compiler has to honor
parentheses).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@[email protected] to comp.arch on Sun Oct 19 19:42:35 2025

From Newsgroup: comp.arch

Michael S <[email protected]> posted:

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <[email protected]d> wrote:

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

Are you suggesting that a brand new #3 ball peen hammer is usefully
better than a 30 YO #3 ball peen hammer ???
--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@[email protected] to comp.arch on Sun Oct 19 18:07:10 2025

From Newsgroup: comp.arch

On Sun, 19 Oct 2025 19:42:35 GMT, MitchAlsup
<[email protected]d> wrote:

Michael S <[email protected]> posted:

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <[email protected]d> wrote:

No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".

Are you suggesting that a brand new #3 ball peen hammer is usefully
better than a 30 YO #3 ball peen hammer ???

With repeated use hammers become brittle. A 30yo hammer is more likely
to crack and/or chip than is a new one.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@[email protected] to comp.arch on Mon Oct 20 08:57:42 2025

From Newsgroup: comp.arch

On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial “killer app” for short-vector SIMD was very much video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.

Agreed. But having SIMD made audio processing more efficient, which was
a nice bonus - especially if you wanted more than CD quality audio.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@[email protected] to comp.arch on Mon Oct 20 11:06:08 2025

From Newsgroup: comp.arch

David Brown wrote:

On 19/10/2025 03:17, Lawrence Dâ€™Oliveiro wrote:

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence Dâ€™Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial â€œkiller appâ€ for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.

Agreed. But having SIMD made audio processing more efficient, which was
a nice bonus - especially if you wanted more than CD quality audio.

Having SIMD available was a key part of making the open source Ogg
Vorbis decoder 3x faster.
It worked on MMX/SSE/SSE2/Altivec.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Mon Oct 20 14:21:14 2025

From Newsgroup: comp.arch

On 10/20/2025 4:06 AM, Terje Mathisen wrote:

David Brown wrote:

On 19/10/2025 03:17, Lawrence Dâ€™Oliveiro wrote:

On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:

MitchAlsup wrote:

On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence Dâ€™Oliveiro wrote:

Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.

MMX was designed to kill off the plug in Modems.

MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.

I think the initial â€œkiller appâ€ for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.

Agreed. But having SIMD made audio processing more efficient, which
was a nice bonus - especially if you wanted more than CD quality audio.

Having SIMD available was a key part of making the open source Ogg
Vorbis decoder 3x faster.

It worked on MMX/SSE/SSE2/Altivec.

Yeah. Audio is fun...

But MP3 and Vorbis have the odd property of either sounding really good
(at high bitrates) or terrible (at lower bitrates, particularly if used
for something with variable playback speed).

Seems to be a general issue with audio codecs built from a similar sort
of block-transform approach (such as MDCT or WHT).

In some of my own experiments in a similar area, I had used WHT, but
didn't get quite so good of results. One problem seems to be that there
is a sort of big issue with frequencies near the block-size, which
result in nasty artifacts. The overlapping blocks and windowing of MDCT
reduce this issue, but as noted, MDCT has a high computational cost (vs
Haar or WHT).

have yet to come up with something in this category that gives
satisfactory results (cheap, simple, effective, and passable quality).

Can also note: ADPCM works OK.

Can get better results IMO at bitrates lower than where MP3 or Vorbis
are effective.

Near the lower end:
16kHz 2-bit ADPCM: OK, 32kbps
11kHz 2-bit ADPCM: meh, 22kbps
8kHz 4-bit ADPCM: Weak, 32kbps
8kHz 2-bit ADPCM: poor, 16kbps

Getting OK results at 2-bits/sample requires a different approach from
what works well at 4 bits, namely rather than encoding one sample at a
time, it is usually needed to encode a block of samples at a time and
then search the entire possibility space. Trying to encode samples one
at a time gives poor results. This makes 2-bit encoding slower and more complicated than 4-bit encoding (but decoder can still be fast).

As noted, ADPCM proper does not work below 2 bits/sample.

The added accuracy of 4-bit samples is not an advantage in this case
since the reduction in sample rate has a more obvious negative impact here.

After trying a few experiments, the current front-runner for going lower is: Encode a group of 8 or 16 samples as an 8-bit index into a table of
patterns (such as groups of 2-bit ADPCM samples);
This can achieve 1.0 or 0.5 bits/sample.

Have yet to get anything with particularly acceptable audio quality though.

Did end up resorting to using genetic algorithms for building the
pattern tables for these experiments. I did previously experiment with
an interpolation pattern table, but this gave worse results.

One other line of experimentation was trying to fudge the ADPCM encoding algorithm to preferentially try to generate repeating patterns over
novel ones with the aim of making it more compressible with LZ77.

However, it was difficult to significantly improve LZ compressibility
while still maintaining some semblance of audio quality. Neither
byte-oriented LZ (eg, LZ4) not Deflate, was particularly effective.

Did note however that both LZMA and an LZMA style bitwise range encoder
were much more effective (particularly with 12 or 16 bits of context).

However, a range encoder is near the upper end of computational
feasibility (and using a range encoder to squeeze bits out of ADPCM
seems kinda absurd).

One intermediate option seems to be a permutation transform. This can
make the data more amendable to STF+AdRice or Huffman.

Say, a 2-bit permutation is transform possible (though, in this case one
can represent every permutation as a 5-bit finite state machine, stored
as bytes in RAM for convenience). This does have the nice property that
one can use an 8 bit table lookup for each context which then produces 2
bits of output at a time.

Say:
hist: 8 bits of history
ival: input, 4x 2-bits
oval: output, 4x 2-bits, permuted

px1=permstate[hist];
ix=((ival>>0)&0x03);
px2=permupdtab[(px1&0xFC)|ix];
permstate[hist]=px2;
hist=(hist<<2)|ix;
oval=px2&3;

px1=permstate[hist];
ix=((ival>>2)&0x03);
px2=permupdtab[(px1&0xFC)|ix];
permstate[hist]=px2;
hist=(hist<<2)|ix;
oval=oval|((px2&3)<<2);
...

Decoding process is similar

One downside of this is that they are still about as slow as using the
bitwise range-coder would have been.

Also, still doesn't really allow breaking into sub 10 kbps territory
without a loss of quality. The use of pattern tables allows breaking
into this territory with a similar loss of quality, and at a lower computational cost.

Though, it seems possible that the permutation transform could be
directly integrated with the ADPCM decoder (in effect turning it into
more of a predictive transform); still wouldn't do much for speed, but
alas. Would also still need an entropy coder to make use of this.

One other route seems to be sinewave synthesis, say:
Pick the top 4 sine waves via some strategy;
Encode the frequency and amplitude (needs ~ 16 bits IME);
Do this ~ 100-128 times per second.
100Hz seems to be a lower limit for intelligibility.

This needs ~ 6.4 to 8.2 kbps, or 7.2 to 9.2 kbps if one also includes a
byte to encode a white noise intensity.

I had best results by taking the space from 2 to 8 kHz, dividing them
into ~ 1/3 octaves, picking the strongest wave from each group, and then picking the top 4 strongest waves. Worked better for me to ignore lower frequencies (low frequencies seem to contain a lot of louder wave-forms,
but which contribute little to intelligibility). In this case, waves
between 2 and 4 kHz tend to dominate.

Works OK for speech, but is poor for non-speech audio.
Quality can be improved by more waves, but this quickly eats any bitrate advantage.
Can note that while called sinewave synthesis, I also got good results
with 3-state waves (-1, 0, 1), which are computationally preferable (wave-shape is: 1,0,-1,0).

Can note that when used for non-speech, sinewave synthesis can have
similar artifacts to low bitrate MP3.

Could be pushed to lower update rates and maybe could make sense for
basic songs (say, as a possible alternative to MIDI; which is arguably a somewhat more complex technology).

Though, can note that for some older systems, sound effects were stored
as variable-frequency square waves (say, for example, updating the
square-wave frequency at 18 Hz or similar, with each frequency stored as
a 16-bit clock-divider value or similar); along with some use of
Delta-Sigma audio (where low-frequency delta-sigma sounds terrible).
Neither are particularly good though.

Though, for general audio storage (such as sound effects), some sort of
ADPCM variant still seems preferable here.

Though, still not yet found anything that is clearly beating 2 bit ADPCM
for this (seemingly a still a good option for sound effects).

And, as noted, could still get good results with ADPCM + LZMA (or
similar), main issue being the high computational cost of the latter.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@[email protected] (Waldek Hebisch) to comp.arch on Fri Oct 24 04:10:03 2025

From Newsgroup: comp.arch

Michael S <[email protected]> wrote:

On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <[email protected]d> wrote:

George Neuner <[email protected]> posted:

Hope the attributions are correct.

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
<[email protected]d> wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <[email protected]d> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

:

In any case, even with these languages there are still
software projects that fail, miss their deadlines and have
overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out
how to make the (17 kinds of) hammers one needs, there it little
need to make a new hammer architecture.

Windows could have stopped at W7, and many MANY people would have
been happier... The mouse was more precise in W7 than in W8 ...
With a little upgrade for new PCIe architecture along the way
rather than redesigning whole kit and caboodle for tablets and
phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998,
... and few people would have cared. Many SW projects are driven
not by demand for the product, but pushed by companies to make
already satisfied users have to upgrade.

Those programmers could have transitioned to new SW projects
rather than redesigning the same old thing 8 more times. Presto,
there is now enough well trained SW engineers to tackle the undone
SW backlog.

The problem is that decades of "New & Improved" consumer products
have conditioned the public to expect innovation (at minimum new
packaging and/or advertising) every so often.

Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by
many current developers due to concern that the project has been
abandoned. That perception likely would not change even if the
author(s) responded to inquiries, the library was suitable "as is"
for the intended use, and the lack of recent updates can be
explained entirely by a lack of new bug reports.

LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.

It is possible that LAPAC API was not updated in decades, although I'd
expect that even at API level there were at least small additions, if
not changes. But if you are right that LAPAC implementation was not
updated in decade than you could be sure that it is either not used by anybody or used by very few people.

AFAICS at logical level interface stays the same. There is significant
change: in old times you were on your own trying to interface
Lapack from C. Now you can get C interface.

Concerning implementation, AFAICS there are changes. Some
improvemnts to accuracy, some to speed. But bulk of code
stays the same. There is a lot of work on lower layer, that
is BLAS. But the idea of Lapack was that higher level algorithms
are portable (also in time), while lower level building blocks
must be adapted to changing computing environment.

There were attempt to replace Lapack by C++ templates, I do not
see this gaining traction. There were attempts to extent Lapack
to larger class of matrices (mostly sparse matrices), apparently
this is less popular than lapack.

There are attempts to automatically convert simple high level
description of operations into high performance code. IIUC
this has some success with FFT and few similar things, but
currently is unable to replace Lapack.

I would say the following: if you have good algorithm, this
algorithm may live long. Sometimes better things are invented
later, but if not, then old algorithm may be used quite long.
Goal of algorithmic languages was to make portable implementation
of algorithms. That works reasonably well, but if one aims
at highest possible speed, then needed tweaks freqiently are
machine specific, so good performance may be nonportable.
In case of Lapack, it seems that there are no better algorithms
now compared to time when Lapack was created. Performance of
Lapack on larger matrices depends mostly on performace of
BLAS, so there is a lot of current work on BLAS. IIUC sometimes
Lapack routines are replaced by better performing versions,
but most of the time gain is too small to justify the effort.

Concerning "being used by few people": there are codes which
are sold to a lot of users were performance or features
matter a lot, such codes tend to evolve quickly. More
typical is growth by adding new parts: old parts are kept
with small changes, but new things are build on it (and
new things independent of old thing are added). There is
also popular "copy and mutate" approach: some parts are
copied and them modified to provide different function
(examples of this are drivers in an OS or new frontends
in a compiler). However, this is partially weakness of
programming language (it would be nicer to have clearly
specified common part and concise specification of
differences needed for various cases). Partly this is
messy nature of real world. Lapack is a happly case
when problem was quite well specified and language
was reasonable fit for the problem. They use textual
substitution to produce real and complex variants
for single and double precision, so in principle
language could do more. And certainly one could wish
nicer and more compact description of the algorithms.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@[email protected] to comp.arch on Fri Oct 24 05:56:08 2025

From Newsgroup: comp.arch

Waldek Hebisch <[email protected]> schrieb:

AFAICS at logical level interface stays the same. There is significant change: in old times you were on your own trying to interface
Lapack from C. Now you can get C interface.

And they got that wrong (by which I was personally bitten).
See https://lwn.net/Articles/791393/ for a good write-up.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,076
Nodes:	10 (1 / 9)
Uptime:	78:38:13
Calls:	13,805
Files:	186,990
D/L today:	5,982 files (1,956M bytes)
Messages:	2,443,207

On Cray arithmetic

Who's Online

System Info