On 2026-01-28 6:43 p.m., BGB wrote:
On 1/28/2026 5:03 PM, MitchAlsup wrote:
BGB <[email protected]> posted:
On 1/28/2026 7:25 AM, Kent Dickey wrote:
Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.
Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did >>> end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.
Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.
When you wanted 32×32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.
Similar here:
32*32=>64: 3-cycle, pipelined;
Considered hard-wired logic mechanism:
~ 12 cycles;
Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
Shift-and-add: 68 cycles (same as DIV/REM).
But, easier to justify the LUTs in the name of RV 'M' support.
Still faster than trap and emulate.
Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware integer divide did manage to outperform using a software shift-and- subtract loop though (so had that much going for it at least).
For Binary64, this unit is around 112 cycles for FDIV (due to quirks).
In the paste, Hardware Newton-Raphson is an option, but is more complicated and expensive to make it work well.
The FMUL is a fair bit faster, and this means software Newton-Raphson is still the most attractive option from the performance POV.
If done for Binary128, would be around 228 cycles for FMUL and FDIV, assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle, or, would-be 456 cycles.
If it were 456 cycles, may as well just use trap-and-emulate at that point...
In the latter case, just using the 32-bit widening integer multiplier to implement the Binary128 FMUL and using Newton-Raphson is likely to be faster.
Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially slow (even more so with FDIV, which for many programs might not happen at all).
...
I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.
FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.
For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).
I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.
On 2026-01-28 6:43 p.m., BGB wrote:
On 1/28/2026 5:03 PM, MitchAlsup wrote:
BGB <[email protected]> posted:
On 1/28/2026 7:25 AM, Kent Dickey wrote:
Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.
Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did >>>> end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.
Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.
When you wanted 32×32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.
Similar here:
32*32=>64: 3-cycle, pipelined;
Considered hard-wired logic mechanism:
~ 12 cycles;
Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
Shift-and-add: 68 cycles (same as DIV/REM).
But, easier to justify the LUTs in the name of RV 'M' support.
Still faster than trap and emulate.
Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware
integer divide did manage to outperform using a software shift-and-
subtract loop though (so had that much going for it at least).
For Binary64, this unit is around 112 cycles for FDIV (due to quirks).
In the paste, Hardware Newton-Raphson is an option, but is more
complicated and expensive to make it work well.
The FMUL is a fair bit faster, and this means software Newton-Raphson
is still the most attractive option from the performance POV.
If done for Binary128, would be around 228 cycles for FMUL and FDIV,
assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle,
or, would-be 456 cycles.
If it were 456 cycles, may as well just use trap-and-emulate at that
point...
In the latter case, just using the 32-bit widening integer multiplier
to implement the Binary128 FMUL and using Newton-Raphson is likely to
be faster.
Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially
slow (even more so with FDIV, which for many programs might not happen
at all).
...
I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.
FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.
For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).
I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.
On Wed, 21 Jan 2026 01:44:08 GMT
MitchAlsup <[email protected]d> wrote:
Anyone still here and active ???
https://www.linfo.org/rule_of_silence.html
I have serious doubts about universality of wisdom of this rule in the
field of human-machine interfaces, but for Usenet interaction it is
golden.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,099 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492374:00:30 |
| Calls: | 14,106 |
| Calls today: | 2 |
| Files: | 187,124 |
| D/L today: |
1,531 files (699M bytes) |
| Messages: | 2,496,031 |