Forum: War Ensemble BBS

Re: floating point history, word order and byte order

From MitchAlsup@[email protected] to comp.arch on Thu Jan 29 21:30:49 2026

From Newsgroup: comp.arch

Robert Finch <[email protected]> posted:

On 2026-01-28 6:43 p.m., BGB wrote:

On 1/28/2026 5:03 PM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did >>> end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32×32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

Similar here:
32*32=>64: 3-cycle, pipelined;
Considered hard-wired logic mechanism:
    ~ 12 cycles;
Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
Shift-and-add: 68 cycles (same as DIV/REM).
    But, easier to justify the LUTs in the name of RV 'M' support.
    Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware integer divide did manage to outperform using a software shift-and- subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson is still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV, assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle, or, would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that point...

In the latter case, just using the 32-bit widening integer multiplier to implement the Binary128 FMUL and using Newton-Raphson is likely to be faster.

Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially slow (even more so with FDIV, which for many programs might not happen at all).

...

I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.

FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.

For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).

An SRT step (iteration) can be done several times per cycle,
3 steps per 16 gate cycle is not that hard.
4 steps per 16 gate cycle is on the edge of doable.

64-bit div is thus on the order of 23-cycles (64/3=21+2 pipeline)
whereas a Goldschmidt with NR correction is 17 cycles IEEE correct
where one knows they are within 1 ULP at cycle 12.

I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.

I suspect they are making you eat the 32-bit adder from each 16×16
instead of doing every thing in carry-save format until the final add.

A 64×32 Booth recoded Dadda/Walace tree is only 5-layers of 4-2
compressors {or 10-gates of delay (after recoder fanout)} plus a
128-bit adder (of your choice) gate delay (say 11-gates of delay);
for a total multiply time of 21 gates or 1.5 cycles.

Add the FP multiplexers, Booth recoding, find first for normalization,
and you are sitting at 3.3 cycles PLUS wire delay.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@[email protected] to comp.arch on Thu Jan 29 17:44:20 2026

From Newsgroup: comp.arch

On 1/29/2026 2:47 AM, Robert Finch wrote:

On 2026-01-28 6:43 p.m., BGB wrote:

On 1/28/2026 5:03 PM, MitchAlsup wrote:

BGB <[email protected]> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did >>>> end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32×32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

Similar here:
   32*32=>64: 3-cycle, pipelined;
   Considered hard-wired logic mechanism:
     ~ 12 cycles;
   Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
   Shift-and-add: 68 cycles (same as DIV/REM).
     But, easier to justify the LUTs in the name of RV 'M' support.
     Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware
integer divide did manage to outperform using a software shift-and-
subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more
complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson
is still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV,
assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle,
or, would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that
point...

In the latter case, just using the 32-bit widening integer multiplier
to implement the Binary128 FMUL and using Newton-Raphson is likely to
be faster.

Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially
slow (even more so with FDIV, which for many programs might not happen
at all).

...

I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.

FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.

For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).

I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.

OK, I have:
Binary64 FMUL: 6 cycles
Binary64 FADD: 6 cycles (incl FSUB, Int<->FP)
Via SIMD Unit:
Binary32 FMUL: 3 cycles (incl SIMD)
Binary32 FADD: 3 cycles (incl SIMD)
FMULA/FADDA: Also 3 cycles (Binary64 format at Binary32 precision).

This mostly leaves N-R as the fastest strategy in this case.

No FMA as there isn't really a good way to get the latency low enough
except in a very niche case of FP8*FP8+FP16, but this would likely only
really be useful for NN's or similar (not as useful as a general purpose
SIMD instruction).

Granted, FP8 for inputs/weights and FP16 accumulators does deem to be a
fairly effective approach for NN's.

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Tim Rentsch@[email protected] to comp.arch on Sat Feb 14 20:49:05 2026

From Newsgroup: comp.arch

Michael S <[email protected]> writes:

On Wed, 21 Jan 2026 01:44:08 GMT
MitchAlsup <[email protected]d> wrote:

Anyone still here and active ???

https://www.linfo.org/rule_of_silence.html

I have serious doubts about universality of wisdom of this rule in the
field of human-machine interfaces, but for Usenet interaction it is
golden.

Thank you for this.
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Wed Mar 11 07:43:46 2026
  from Noozle City via Telnet
- Microbot
  Wed Mar 11 00:59:44 2026
  from Moore, Ok via Telnet
- Noozle
  Tue Mar 10 16:57:26 2026
  from Noozle City via Telnet
- Neko
  Tue Mar 10 15:16:47 2026
  from San Francisco, Ca via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,099
Nodes:	10 (0 / 10)
Uptime:	492374:00:30
Calls:	14,106
Calls today:	2
Files:	187,124
D/L today:	1,531 files (699M bytes)
Messages:	2,496,031

Re: floating point history, word order and byte order

Who's Online

Recent Visitors

System Info