• Re: floating point history, word order and byte order

    From MitchAlsup@[email protected] to comp.arch on Thu Jan 29 21:30:49 2026
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    On 2026-01-28 6:43 p.m., BGB wrote:
    On 1/28/2026 5:03 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 1/28/2026 7:25 AM, Kent Dickey wrote:


    Sort of reminds me of one case where I evaluated the possibility of a
    64-bit hardware multiplier which would internally decompose it into
    32x32->64 bit widening multiplies and add the parts back together.

    Then noted the drawback that this wouldn't have been much faster than
    doing it in software (using the same general strategy). Eventually did >>> end up adding a (significantly slower, but cheaper) shift-and-add
    hardware multiplier.

    Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
    FP32 was 4 cycles, PF64 was 7 cycles.

    When you wanted 32×32->64 there was a 12-cycle instruction sequence
    that would provide--any yes it required extracting 16-bit partials
    multiplying 4 of them and adding them all up.


    Similar here:
      32*32=>64: 3-cycle, pipelined;
      Considered hard-wired logic mechanism:
        ~ 12 cycles;
      Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
      Shift-and-add: 68 cycles (same as DIV/REM).
        But, easier to justify the LUTs in the name of RV 'M' support.
        Still faster than trap and emulate.

    Where 64-bit integer MUL and DIV being not quite rare enough for trap
    and emulate to be acceptable from a performance POV. The slow hardware integer divide did manage to outperform using a software shift-and- subtract loop though (so had that much going for it at least).


    For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

    In the paste, Hardware Newton-Raphson is an option, but is more complicated and expensive to make it work well.

    The FMUL is a fair bit faster, and this means software Newton-Raphson is still the most attractive option from the performance POV.




    If done for Binary128, would be around 228 cycles for FMUL and FDIV, assuming the Shift-and-Add unit remains 1 bit per cycle.
    There is concern that internal latency could require 0.5 bit/cycle, or, would-be 456 cycles.

    If it were 456 cycles, may as well just use trap-and-emulate at that point...

    In the latter case, just using the 32-bit widening integer multiplier to implement the Binary128 FMUL and using Newton-Raphson is likely to be faster.

    Main merit of Binary128 though being that "long double" is so
    infrequently used that it almost doesn't matter if it is glacially slow (even more so with FDIV, which for many programs might not happen at all).

    ...



    I seem to find that it is difficult to get better performance for FDIV
    than using a simple divider.

    FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
    is just about as time consuming as using a divider.

    For FDIV (or FMUL) with a radix-2 divide it can probably operate at
    double the CPU clock frequency. For instance the FDIV in my float
    package runs at almost 300 MHz. But the CPU can only be clocked about
    100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).

    An SRT step (iteration) can be done several times per cycle,
    3 steps per 16 gate cycle is not that hard.
    4 steps per 16 gate cycle is on the edge of doable.

    64-bit div is thus on the order of 23-cycles (64/3=21+2 pipeline)
    whereas a Goldschmidt with NR correction is 17 cycles IEEE correct
    where one knows they are within 1 ULP at cycle 12.

    I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.

    I suspect they are making you eat the 32-bit adder from each 16×16
    instead of doing every thing in carry-save format until the final add.

    A 64×32 Booth recoded Dadda/Walace tree is only 5-layers of 4-2
    compressors {or 10-gates of delay (after recoder fanout)} plus a
    128-bit adder (of your choice) gate delay (say 11-gates of delay);
    for a total multiply time of 21 gates or 1.5 cycles.

    Add the FP multiplexers, Booth recoding, find first for normalization,
    and you are sitting at 3.3 cycles PLUS wire delay.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Jan 29 17:44:20 2026
    From Newsgroup: comp.arch

    On 1/29/2026 2:47 AM, Robert Finch wrote:
    On 2026-01-28 6:43 p.m., BGB wrote:
    On 1/28/2026 5:03 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 1/28/2026 7:25 AM, Kent Dickey wrote:


    Sort of reminds me of one case where I evaluated the possibility of a
    64-bit hardware multiplier which would internally decompose it into
    32x32->64 bit widening multiplies and add the parts back together.

    Then noted the drawback that this wouldn't have been much faster than
    doing it in software (using the same general strategy). Eventually did >>>> end up adding a (significantly slower, but cheaper) shift-and-add
    hardware multiplier.

    Mc 88100 uses a 32×32 multiplier:: integer multiply was 3 cycles,
    FP32 was 4 cycles, PF64 was 7 cycles.

    When you wanted 32×32->64 there was a 12-cycle instruction sequence
    that would provide--any yes it required extracting 16-bit partials
    multiplying 4 of them and adding them all up.


    Similar here:
       32*32=>64: 3-cycle, pipelined;
       Considered hard-wired logic mechanism:
         ~ 12 cycles;
       Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
       Shift-and-add: 68 cycles (same as DIV/REM).
         But, easier to justify the LUTs in the name of RV 'M' support.
         Still faster than trap and emulate.

    Where 64-bit integer MUL and DIV being not quite rare enough for trap
    and emulate to be acceptable from a performance POV. The slow hardware
    integer divide did manage to outperform using a software shift-and-
    subtract loop though (so had that much going for it at least).


    For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

    In the paste, Hardware Newton-Raphson is an option, but is more
    complicated and expensive to make it work well.

    The FMUL is a fair bit faster, and this means software Newton-Raphson
    is still the most attractive option from the performance POV.




    If done for Binary128, would be around 228 cycles for FMUL and FDIV,
    assuming the Shift-and-Add unit remains 1 bit per cycle.
    There is concern that internal latency could require 0.5 bit/cycle,
    or, would-be 456 cycles.

    If it were 456 cycles, may as well just use trap-and-emulate at that
    point...

    In the latter case, just using the 32-bit widening integer multiplier
    to implement the Binary128 FMUL and using Newton-Raphson is likely to
    be faster.

    Main merit of Binary128 though being that "long double" is so
    infrequently used that it almost doesn't matter if it is glacially
    slow (even more so with FDIV, which for many programs might not happen
    at all).

    ...



    I seem to find that it is difficult to get better performance for FDIV
    than using a simple divider.

    FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
    is just about as time consuming as using a divider.

    For FDIV (or FMUL) with a radix-2 divide it can probably operate at
    double the CPU clock frequency. For instance the FDIV in my float
    package runs at almost 300 MHz. But the CPU can only be clocked about
    100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).

    I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.


    OK, I have:
    Binary64 FMUL: 6 cycles
    Binary64 FADD: 6 cycles (incl FSUB, Int<->FP)
    Via SIMD Unit:
    Binary32 FMUL: 3 cycles (incl SIMD)
    Binary32 FADD: 3 cycles (incl SIMD)
    FMULA/FADDA: Also 3 cycles (Binary64 format at Binary32 precision).

    This mostly leaves N-R as the fastest strategy in this case.

    No FMA as there isn't really a good way to get the latency low enough
    except in a very niche case of FP8*FP8+FP16, but this would likely only
    really be useful for NN's or similar (not as useful as a general purpose
    SIMD instruction).

    Granted, FP8 for inputs/weights and FP16 accumulators does deem to be a
    fairly effective approach for NN's.

    ...



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Tim Rentsch@[email protected] to comp.arch on Sat Feb 14 20:49:05 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:

    On Wed, 21 Jan 2026 01:44:08 GMT
    MitchAlsup <[email protected]d> wrote:

    Anyone still here and active ???

    https://www.linfo.org/rule_of_silence.html

    I have serious doubts about universality of wisdom of this rule in the
    field of human-machine interfaces, but for Usenet interaction it is
    golden.

    Thank you for this.
    --- Synchronet 3.21b-Linux NewsLink 1.2