• Re: Tonights Tradeoff

    From Terje Mathisen@[email protected] to comp.arch on Sun Feb 1 17:51:07 2026
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Paul Clayton <[email protected]> writes:
    On 11/13/25 5:13 PM, MitchAlsup wrote:

    [email protected] (Anton Ertl) posted:
    [snip]
    What I wanted to write was "And assembly language is
    architecture-specific".

    I have worked on a single machine with several different ASM "compilers". >>> Believe me, one asm can be different than another asm.

    But it is absolutely true that asm is architecture specific.

    Is that really *absolutely* true? Architecture usually includes binary>> encoding (and memory order model and perhaps other non-assembly details).

    I do not know if being able to have an interrupt in the middle of an
    assembly instruction is a violation of the assembly contract. (In
    theory, a few special cases might be handled such that the assembly
    instruction that breaks into more than one machine instruction is
    handled similarly to breaking instructions into µops.) There might not
    be any practical case where all the sub-instructions of an assembly
    instruction are also assembly instructions (especially not if
    retaining instruction size compatibility, which would be difficult
    with such assembly instruction fission anyway).

    The classic case is the VAX MOVC3/MOVC5 instructions. An interrupt
    could occur during the move and simply restart the instruction
    (the register operands having been updated as each byte was moved).
    An even more common example (numbering in the 100M to 1B range?) is x86 processors with interruptible REP MOVS/STOS/LODS instructions.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sun Feb 1 18:01:13 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:
    reasonable to have a fuzzier sense of assembly language to include at
    least encoding changes. It seems reasonable to me for "assembly
    language" to mean the preferred language for simple mapping to machine>> instructions (which can include idioms — different spellings of the
    same machine instruction — and macros).

    The modern sense of ASM is that it is an ASCII version of binary.
    The old sense where ASM was a language that could do anything and
    everything (via Macros) has slipped into the past.

    In my current world, asm is what I use for inline kernels that cannot be directly described in Rust (or C(+), letting the ocmpiler handle all the scaffolding that would have been handled by asm MACROs 40 years ago.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Wed Feb 4 22:31:23 2026
    From Newsgroup: comp.arch

    On 1/28/26 10:34 AM, John Dallman wrote:
    In article <10lbcg1$3uh8h$[email protected]>, [email protected] (Paul Clayton) wrote:

    I _feel_ that if only the opcode encoding is changed (a very tiny
    difference that would only affect using code as data) that one
    could rightly state that the new architecture uses the same
    assembly.

    That would, however, raise questions and doubts among everyone who was
    aware of the different instruction encodings. You would do far better to
    say that the new architecture is compatible at the assembler source level, but not at the binary level.

    I tend to agree. I was arguing semantics (what is assembly?) not
    best practice.

    Currently, assembly-level compatibility does not seem worthwhile.
    Software is usually distributed as machine code binaries not as
    assembly, and software is usually developed in at least a C-level
    language rather than assembly. In the past, easy translation of
    assembly to support a new machine language would be useful, but
    this seems not to be the case now.



    I doubt there could be any economic justification for
    only changing the opcode encoding, but theoretically such could
    have multiple architectures with the same assembly.

    There was a threatened case of this in the early years of this century.
    Intel admitted to themselves that AMD64 was trouncing Itanium in the marketplace, and they needed to do 64-bit x86 or see their company shrink dramatically. However, they did not want to do an AMD-compatible x86-64.
    They wanted to use a different instruction encoding and have deliberate binary incompatibility.

    Would the Intel-64 have been assembly compatible with AMD64? I
    would have guessed that not just encodings would have been
    different. If one wants to maintain market friction, supporting
    the same assembly seems counterproductive.

    This was crazy from the network externalities point of view. It was an anti-competitive move, requiring software vendors to do separate builds
    for Intel and AMD, hoping that they would not bother with AMD builds.

    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think). However, doing what is best generally
    for customers is not necessarily the most profitable action.

    Microsoft killed this idea, by refusing to support any such
    Intel-specific 64-bit x86. They could not prevent Intel doing it, but
    there would not be Windows for it. Intel had to climb down.

    Which was actually a sane action not just from the hassle to
    Microsoft of supporting yet another ISA but the confusion of
    users (Intel64 and AMD64 both run x86-32 binaries but neither
    Intel64 nor AMD64 run the other's binaries!) which would impact
    Microsoft (and PC OEMs) more than Intel.

    I do not think assembly language considered the possible effects of
    memory order model. (Have all x86 implementations been compatible?
    I think the specification changed, but I do not know if
    compatibility was broken.)

    In general, the assembly programmer is responsible for considering the
    memory model, not the language implementation.

    Yes, but for a single-threaded application this is not a factor —
    so such would be more compatible. It is not clear if assembly
    programmers would use less efficient abstractions (like locks) to
    handle concurrency in which case a different memory model might
    not impact correctness. On the one hand, assembly is generally
    chosen because C provides insufficient performance (or
    expressiveness), which would imply that assembly programmers
    would not want to leave any performance on the table and would
    exploit the memory model. On the other hand, the assembly
    programmer mindset may often be more serial and the performance
    cost of using higher abstractions for concurrency may be lower
    than the debugging costs of being clever relative to using
    cleverness for other optimizations.

    In addition to the definition for "assembly language" one also
    needs to define "architecture".

    Actually, the world seems to get on OK without such clear definitions.
    The obscurity of assembly language tends to limit its use to those who
    really need to use it, and who are prepared to use a powerful but
    unforgiving tool.

    Yes, the niche effect helps to avoid diversity of meaning across
    users and across time. I suspect jargon also changes less rapidly
    than common language both because there is less interaction and
    there is more pressure to be formal in expression.

    Intel has sold incompatible architectures within the same design
    by fusing off functionality and has even had different application
    cores in the same chip have different instruction support (though
    that seems to have bitten Intel).

    Well, different ISA support in different cores in the same processor
    package is just dumb[1]. It reflects a delusion that Intel has suffered
    since at least the late 1990s: that software is specific to particular generations of their chips, and there's a new release with significant changes for each new generation. Plenty of Intel people know that is true
    for motherboard firmware, but not for operating systems or application software. But the company carries on behaving that way.

    I do not think ISA heterogeneity is necessarily problematic. I
    suspect it might require more system-level organization (similar
    to Apple). Even without ISA heterogeneity, optimal scheduling
    seems to be a hard problem. Energy/power and delay/performance
    preferences are not typically expressed. The abstraction of each
    program owning the machine seems to discourage nice behavior (pun
    intended).

    Intel seems to be conflicted between encouraging software use of
    features and extracting profit from those users who benefit more
    from certain features. Maximizing availability of an
    architectural feature encourages software to adopt the feature,
    but limiting availability allows charging more for enabling a
    feature.


    [1] See the Cell processor for an extreme example.

    I thought Cell was almost an embedded system. The SIMD-focused
    processors were more like GPUs, I thought, and intended to be
    used as such. For games, this might have made sense. However,
    I think this was before General Purpose GPU was a thing.
    (I thought Intel marketed their initial 512-bit SIMD processors
    as GPGPUs with x86 compatibility, so the idea of having a
    general purpose ISA morphed into a GPU-like ISA had some
    fascination after Cell.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Feb 5 19:02:14 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 1/28/26 10:34 AM, John Dallman wrote:
    In article <10lbcg1$3uh8h$[email protected]>, [email protected] (Paul Clayton) wrote:
    -----------------

    Would the Intel-64 have been assembly compatible with AMD64? I

    Andy Glew indicated similar but not exact enough.
    Andy also stated that MicroSoft forced Intel's hand towards x86-64.

    would have guessed that not just encodings would have been
    different. If one wants to maintain market friction, supporting
    the same assembly seems counterproductive.

    It was, in essence, the control register model, the nested paging,
    and other insundry non ISA components.

    This was crazy from the network externalities point of view. It was an anti-competitive move, requiring software vendors to do separate builds
    for Intel and AMD, hoping that they would not bother with AMD builds.

    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think). However, doing what is best generally
    for customers is not necessarily the most profitable action.

    Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
    and making optimal battle plans for Little Big Horn battle to come.

    Microsoft killed this idea, by refusing to support any such
    Intel-specific 64-bit x86. They could not prevent Intel doing it, but
    there would not be Windows for it. Intel had to climb down.

    Which was actually a sane action not just from the hassle to
    Microsoft of supporting yet another ISA but the confusion of
    users (Intel64 and AMD64 both run x86-32 binaries but neither
    Intel64 nor AMD64 run the other's binaries!) which would impact
    Microsoft (and PC OEMs) more than Intel.

    I do not think assembly language considered the possible effects of
    memory order model. (Have all x86 implementations been compatible?
    I think the specification changed, but I do not know if
    compatibility was broken.)

    In general, the assembly programmer is responsible for considering the memory model, not the language implementation.

    Yes, but for a single-threaded application this is not a factor —
    so such would be more compatible. It is not clear if assembly
    programmers would use less efficient abstractions (like locks) to
    handle concurrency in which case a different memory model might
    not impact correctness. On the one hand, assembly is generally
    chosen because C provides insufficient performance (or
    expressiveness), which would imply that assembly programmers
    would not want to leave any performance on the table and would
    exploit the memory model. On the other hand, the assembly
    programmer mindset may often be more serial and the performance
    cost of using higher abstractions for concurrency may be lower
    than the debugging costs of being clever relative to using
    cleverness for other optimizations.

    In addition to the definition for "assembly language" one also
    needs to define "architecture".

    Actually, the world seems to get on OK without such clear definitions.
    The obscurity of assembly language tends to limit its use to those who really need to use it, and who are prepared to use a powerful but unforgiving tool.

    Yes, the niche effect helps to avoid diversity of meaning across
    users and across time. I suspect jargon also changes less rapidly
    than common language both because there is less interaction and
    there is more pressure to be formal in expression.

    Intel has sold incompatible architectures within the same design
    by fusing off functionality and has even had different application
    cores in the same chip have different instruction support (though
    that seems to have bitten Intel).

    Well, different ISA support in different cores in the same processor package is just dumb[1]. It reflects a delusion that Intel has suffered since at least the late 1990s: that software is specific to particular generations of their chips, and there's a new release with significant changes for each new generation. Plenty of Intel people know that is true for motherboard firmware, but not for operating systems or application software. But the company carries on behaving that way.

    One can still buy a milling machine built in 1937 and run it in his shop.
    Can one even do this for software from the previous decade ??

    MS wants you to buy Office every time you buy a new PC.
    MS, then moves all the menu items to different pull downs and
    makes it difficult to adjust to the new SW--and then it has the
    Gaul to chew up valuable screen space with ever larger pull-
    down bars.

    Is it any wonder users want the 1937 milling machine model ???

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Thu Feb 5 14:35:57 2026
    From Newsgroup: comp.arch

    On 2/5/2026 11:02 AM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 1/28/26 10:34 AM, John Dallman wrote:
    In article <10lbcg1$3uh8h$[email protected]>, [email protected] (Paul >>> Clayton) wrote:
    -----------------

    Would the Intel-64 have been assembly compatible with AMD64? I

    Andy Glew indicated similar but not exact enough.
    Andy also stated that MicroSoft forced Intel's hand towards x86-64.
    [...]

    Side note (sorry for injecting ;^o ): I had the pleasure to converse
    with Andy Glew on this very group. Very nice indeed. All about DWCAS and
    fun things. This is a nice group.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Fri Feb 6 15:54:00 2026
    From Newsgroup: comp.arch

    In article <10m12ue$2t2k5$[email protected]>, [email protected] (Paul Clayton) wrote:

    Currently, assembly-level compatibility does not seem worthwhile.

    Not now, no. There was one case where it was valuable: the assembler
    source translator for 8080 to 8086. That plus the resemblance of early
    MS-DOS to CP/M meant that CP/M software written in assembler could be got working on the early IBM PC and compatibles more rapidly than new
    software could be developed in high-level languages. That was one of the factors in the runaway success of PC-compatible machines in the early
    1980s.

    Software is usually distributed as machine code binaries not as
    assembly,

    Or as source code...

    Would the Intel-64 have been assembly compatible with AMD64? I
    would have guessed that not just encodings would have been
    different. If one wants to maintain market friction, supporting
    the same assembly seems counterproductive.

    It would hardly have mattered. Very little assembler is written for
    64-bit architectures.

    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think).

    Intel didn't admit to themselves they needed to do 64-bit x86 until AMD64
    was thrashing them in the market. Far too late for collaborative design
    by then.

    It is not clear if assembly programmers would use less efficient
    abstractions (like locks) to handle concurrency in which case
    a different memory model might not impact correctness.

    You are thinking of doing application programming in assembler. That's
    pretty much extinct these days. Use of assembler to implement locks or
    other concurrency-control mechanisms in an OS or a language run-time
    library is far more likely.

    I've been doing low-level parts of application development for over 40
    years. In 1983-86, I was working in assembler, or needed to have a very
    close awareness of the assembler code being generated by a higher level language. In 1987-1990, I needed to be able to call assembler-level OS functions from C code. Since then, the only coding I've done in assembler
    has been to generate hardware error conditions for testing error handlers.
    I've read and debugged lots of compiler-generated assembler to report
    compiler bugs, but that has become far less common over time.

    I do not think ISA heterogeneity is necessarily problematic.

    It requires the OS scheduler to be ISA-aware, and to never, /ever/ put a
    thread onto a core that can't run the relevant ISA. That will inevitably
    make the scheduler more complicated and thus increase system overheads.

    I suspect it might require more system-level organization (similar
    to Apple).

    Have you ever tried to optimise multi-threaded performance on a modern
    Apple system with a mixture of Performance and Efficiency cores? I have,
    and it's a lot harder than Apple give the impression it will be.

    Apple make an assumption: that you will use their "Grand Central Dispatch" threading model. That requires multi-threaded code to be structured as a one-direction pipeline of work packets, with buffers between them, and
    one thread/core per pipeline stage. That's a sensible model for some
    kinds of work, but not all kinds. It also requires compiler extensions
    which don't exist on other compilers. So you have to fall back to POSIX
    threads to get flexibility and portability.

    If you're using POSIX threads, the scheduler seems to assign threads to
    cores randomly. So your worker threads spend a lot of time on Efficiency
    cores. Those are in different clusters from the Performance cores, which
    means that communications between threads (via locks) are very slow.
    Using Apple's performance category attributes for threads has no obvious
    effect on this.

    The way to fix this is to find out how many Performance cores there are
    in a Performance cluster (which wasn't possible until macOS 12) and use
    that many threads. Then you need to reach below the POSIX threading layer
    to the underlying BSD thread layer. There, you can set an association
    number on your threads, which tells the scheduler to try to run them in
    the same cluster. Then you get stable and near-optimal performance. But
    finding out how to do this is fairly hard, and few seem to managed it.

    Even without ISA heterogeneity, optimal scheduling
    seems to be a hard problem. Energy/power and delay/performance
    preferences are not typically expressed. The abstraction of each
    program owning the machine seems to discourage nice behavior (pun
    intended).

    Allowing processes to find out the details of other processes' resource
    usage makes life very complicated, and introduces new opportunities for security bugs.

    (I thought Intel marketed their initial 512-bit SIMD processors
    as GPGPUs with x86 compatibility, so the idea of having a
    general purpose ISA morphed into a GPU-like ISA had some
    fascination after Cell.)

    Larabee turned out to be a pretty bad GPU, and a pretty bad set of CPUs.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sat Feb 7 21:49:08 2026
    From Newsgroup: comp.arch

    On 11/5/25 3:52 PM, MitchAlsup wrote:

    Robert Finch <[email protected]> posted:

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    -----------
    I am now modifying Qupls2024 into Qupls2026 rather than starting a
    completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that
    reduced most instructions eight bits.

    4 register specifiers: check.

    I decided I liked the dual operations that some instructions supported,
    which need a wide instruction format.

    With 48-bits, if you can get 2 instructions 50% of the time, you are only
    12% bigger than a 32-bit ISA.

    I must be misunderstanding your math; if half of the
    6-byte instructions are two operations, I think that
    means 12 bytes would have three operations which is
    the same as for a 32-bit ISA.

    Perhaps you meant for every two instructions, there
    is a 50% chance neither can be "fused" and a 50%
    chance they can be fused with each other; this would
    get four operations in 18 bytes, which _is_ 12.5%
    bigger. That seems an odd expression, as if the
    ability to fuse was not quasi-independent.

    It could just be that one of us has a "thought-O".

    One gotcha is that 64-bit constant overrides need to be modified. For
    Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit
    instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift
    instruction. It is ugly and takes about three instructions.

    It was that sticking problem of constants that drove most of My 66000
    ISA style--variable length and how to encode access to these constants
    and routing thereof.

    Motto: never execute any instructions fetching or building constants.

    I am guessing that having had experience with x86
    (and the benefit of predecode bits), you recognized
    that VLE need not be horribly complex to parse.
    My 66000 does not use "start bits", but the length
    is quickly decoded from the first word and the
    critical information is in mostly fixed locations
    in the first word. (One might argue that opcode
    can be in two locations depending on if the
    instruction uses a 16-bit immediate or not —
    assuming I remember that correctly.)

    Obviously, something like DOUBLE could provide
    extra register operands to a complex instruction,
    though there may not be any operation needing
    five register inputs. Similarly, opcode refinement
    (that does not affect operation routing) could be
    placed into an "immediate". I think you do not
    expect to need such tricks because reduced
    number of instructions is a design principle and
    there is lots of opcode space remaining, but I
    feel these also allow the ISA to be extended in
    unexpected directions.

    I think that motto could be generalized to "do
    not do at decode time what can be done at
    compile time" (building immediates could be
    "executed" in decode). There are obvious limits
    to that principle; e.g., one would not encode
    instructions as control bits, i.e., "predecoded",
    in order to avoid decode work. For My 66000
    immediates, reducing decode work also decreases
    code size.

    Discerning when to apply a transformation and if/
    where to cache the result seems useful. E.g., a
    compiler caches the source code to machine code
    transformation inside an executable binary. My
    66000's Virtual Vector Method implementations
    are expected, from what I understand, to cache
    fetch and decode work and simplify operand
    routing.

    Caching branch prediction information in an
    instruction seems to be viewed generally as not
    worth much since dynamic predictors are generally
    more accurate. Static prediction by branch
    "type" (e.g., forward not-taken) can require no
    additional information. (Branch prediction
    _directives_ are somewhat different. Such might
    be used to reduce the time for a critical path,
    but average time is usually a greater concern.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun Feb 8 10:24:54 2026
    From Newsgroup: comp.arch

    On 11/5/25 3:43 PM, MitchAlsup wrote:
    [snip]
    I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
    a) VEC-LOOP loops do not alter the branch prediction tables.
    b) Predication clauses do not alter the BPTs.

    Not recording the history of predicates may have a negative
    effect on global history predictors. (I do not know if anyone
    has studied this, but it has been mentioned — e.g.,
    "[predication] has a negative side-effect because the removal
    of branches eliminates useful correlation information
    necessary for conventional branch predictors" from "Improving
    Branch Prediction and Predicated Execution in Out-of-Order
    Processors", Eduardo Quiñones et al., 2007.)

    Predicate prediction can also be useful when the availability
    of the predicate is delayed. Similarly, selective eager
    execution might be worthwhile when the predicate is delayed;
    the selection is likely to be predictive (resource use might
    be a basis for selection but even estimating that might be
    predictive).

    With predication of all short forward branches in order to
    avoid fetch bubbles, the impact of delayed predicate
    availability and missing information for branch prediction
    may be greater than for more selective predication.

    There may also be some short forward branches that are 99%
    taken such that converting to a longer branch with a jump back
    may be a better option. With trace-cache-like optimization,
    such branched over code could be removed from fetch even when
    the compiler used a short branch. Dynamic code organization
    has the advantage of being able to use dynamically available
    information (and the disadvantage of gathering the information
    and making a decision dynamically).

    Something like a branch target cache could store extracted
    instructions. This might facilitate stitching such
    instructions back into the instruction stream with limited
    overhead. Since this would only work for usually taken
    hammock branches, it would probably not be worthwhile. For
    if-then-else constructs, one might place both paths in
    separate entries in such a target cache and always stitch
    in one of them, but that seems wonky.

    I rather doubt the benefits of such would justify the added
    complexity — almost certainly not in a first or second
    implementation of an architecture — but I would not want to
    reject future possible adoption of such techniques.

    My guess would be that most short forward branches would not
    use an extracted code cache either because they are generally
    not taken so there is no fetch advantage or because the branch
    direction in unpredictable such that predication likely makes
    more sense and fetching from two structures just adds
    complexity.

    For highly unlikely code, an extracted cache might have
    higher latency (from placement, from deferred access to use
    better prediction, or from more complex retrieval). Stalling
    renaming when a longer-latency insertion is predicted seems
    undesirable (though it may have negligible performance harm),
    but including just enough dataflow information in the quicker
    accessed caches to support out-of-order fetch seems
    complicated.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sun Feb 8 18:22:46 2026
    From Newsgroup: comp.arch

    On 2/5/26 2:02 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    [snip]
    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think). However, doing what is best generally
    for customers is not necessarily the most profitable action.

    Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
    and making optimal battle plans for Little Big Horn battle to come.

    Rather than making battle plans for how to annihilate each
    other, perhaps finding a better solution than the ratting each
    other out in the prisoner's dilemma.

    [snip]
    One can still buy a milling machine built in 1937 and run it in his shop.
    Can one even do this for software from the previous decade ??

    Yes, but dependency on (proprietary) servers for some games has
    made them (unnecessarily) unplayable.

    From what I understand, one can still run WordPerfect under a
    DOS emulator on modern x86-64.

    With the poor security of much software, even OSes, one might
    want to contain any legacy software in a more secured
    environment.

    Preventing automatic update is perhaps more of a hassle. Some
    people have placed software in a virtual machine that has no
    networking to avoid software breaking.

    MS wants you to buy Office every time you buy a new PC.

    I thought MS wanted everyone to use Office365. It is harder to
    force people to get a new computer, but a monthly fee will recur
    automatically.

    MS, then moves all the menu items to different pull downs and
    makes it difficult to adjust to the new SW--and then it has the
    Gaul to chew up valuable screen space with ever larger pull-
    down bars.

    Ah, but they are just beginning to include advertising. Imagine
    every time one uses the mouse (to indicate to the computer that
    the user's eyes are focused on a particular place) an
    advertisement appears and follows the cursor movement. Even just
    having menu entries that are advertisements would be kind of
    annoying, but one would be able to get rid of those by leasing
    the premium edition (until one needs to lease the platinum
    edition, then the "who wants to remain a millionaire" edition).

    Is it any wonder users want the 1937 milling machine model ???

    Have no fear; soon you may be merely leasing your computer.
    Computers need to have the latest spyware so that advertisements
    can be appropriately targeted and adblocking must be made
    impossible.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Feb 9 19:09:50 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 11/5/25 3:52 PM, MitchAlsup wrote:

    Robert Finch <[email protected]> posted:

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    -----------
    I am now modifying Qupls2024 into Qupls2026 rather than starting a
    completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and >> not really needed. Register specs are reduced to 6-bits. Right-away that >> reduced most instructions eight bits.

    4 register specifiers: check.

    I decided I liked the dual operations that some instructions supported,
    which need a wide instruction format.

    With 48-bits, if you can get 2 instructions 50% of the time, you are only 12% bigger than a 32-bit ISA.

    I must be misunderstanding your math; if half of the
    6-byte instructions are two operations, I think that
    means 12 bytes would have three operations which is
    the same as for a 32-bit ISA.

    Perhaps you meant for every two instructions, there
    is a 50% chance neither can be "fused" and a 50%
    chance they can be fused with each other; this would
    get four operations in 18 bytes, which _is_ 12.5%
    bigger. That seems an odd expression, as if the
    ability to fuse was not quasi-independent.

    It could just be that one of us has a "thought-O".

    One gotcha is that 64-bit constant overrides need to be modified. For
    Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit
    instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift
    instruction. It is ugly and takes about three instructions.

    It was that sticking problem of constants that drove most of My 66000
    ISA style--variable length and how to encode access to these constants
    and routing thereof.

    Motto: never execute any instructions fetching or building constants.

    I am guessing that having had experience with x86
    (and the benefit of predecode bits), you recognized
    that VLE need not be horribly complex to parse.
    My 66000 does not use "start bits", but the length
    is quickly decoded from the first word and the
    critical information is in mostly fixed locations
    in the first word.

    My 66000 Constants are available when:
    inst<31> = 0 and
    inst<30> != inst<29> and
    inst<6> = 1 where
    inst<6:5> = 10 means 32-bit
    constant and inst<6:5> = 11 means 64-bit constant.
    6 total gates and 2-gates of delay gives unary 3-bit
    instruction length.

    (One might argue that opcode
    can be in two locations depending on if the
    instruction uses a 16-bit immediate or not —
    assuming I remember that correctly.)

    Obviously, something like DOUBLE could provide
    extra register operands to a complex instruction,
    though there may not be any operation needing
    five register inputs. Similarly, opcode refinement
    (that does not affect operation routing) could be
    placed into an "immediate". I think you do not
    expect to need such tricks because reduced
    number of instructions is a design principle and
    there is lots of opcode space remaining, but I
    feel these also allow the ISA to be extended in
    unexpected directions.

    I think that motto could be generalized to "do
    not do at decode time what can be done at
    compile time" (building immediates could be
    or link time.
    "executed" in decode). There are obvious limits
    to that principle; e.g., one would not encode
    instructions as control bits, i.e., "predecoded",
    in order to avoid decode work. For My 66000
    immediates, reducing decode work also decreases
    code size.

    Discerning when to apply a transformation and if/
    where to cache the result seems useful. E.g., a
    compiler caches the source code to machine code
    transformation inside an executable binary. My
    66000's Virtual Vector Method implementations
    are expected, from what I understand, to cache
    fetch and decode work and simplify operand
    routing.

    First v in vVM is lower case.

    Caching branch prediction information in an
    instruction seems to be viewed generally as not
    worth much since dynamic predictors are generally
    more accurate.

    Yes. If your branch predictor is having problems
    then use predication for flow control.

    Static prediction by branch
    "type" (e.g., forward not-taken) can require no
    additional information. (Branch prediction
    _directives_ are somewhat different. Such might
    be used to reduce the time for a critical path,
    but average time is usually a greater concern.)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Feb 9 19:20:01 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 11/5/25 3:43 PM, MitchAlsup wrote:
    [snip]
    I am now working on predictors for a 6-wide My 66000 machine--which is a bit
    different.
    a) VEC-LOOP loops do not alter the branch prediction tables.
    b) Predication clauses do not alter the BPTs.

    Not recording the history of predicates may have a negative
    effect on global history predictors. (I do not know if anyone
    has studied this, but it has been mentioned — e.g.,
    "[predication] has a negative side-effect because the removal
    of branches eliminates useful correlation information
    necessary for conventional branch predictors" from "Improving
    Branch Prediction and Predicated Execution in Out-of-Order
    Processors", Eduardo Quiñones et al., 2007.)

    It depends on where you are looking! If you think branch prediction
    alters where FETCH is Fetching, then MY 66000 predication does not
    do predication prediction--predication is used when the join point
    will have already been fetched by the time the condition is known.
    Then, either the then clause or the else clause will be nullified
    without backup (i.e., branch prediction repair).

    DECODE is still able to predict then-clause versus else-clause
    and maintain the no-backup property, as long as both sides are
    issued into the execution window.

    Predicate prediction can also be useful when the availability
    of the predicate is delayed. Similarly, selective eager
    execution might be worthwhile when the predicate is delayed;
    the selection is likely to be predictive (resource use might
    be a basis for selection but even estimating that might be
    predictive).

    The difference is that predication prediction never needs branch
    prediction repair.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Feb 9 19:33:36 2026
    From Newsgroup: comp.arch


    Paul Clayton <[email protected]> posted:

    On 2/5/26 2:02 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    [snip]
    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think). However, doing what is best generally
    for customers is not necessarily the most profitable action.

    Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
    and making optimal battle plans for Little Big Horn battle to come.

    Rather than making battle plans for how to annihilate each
    other, perhaps finding a better solution than the ratting each
    other out in the prisoner's dilemma.

    [snip]
    One can still buy a milling machine built in 1937 and run it in his shop. Can one even do this for software from the previous decade ??

    Yes, but dependency on (proprietary) servers for some games has
    made them (unnecessarily) unplayable.

    From what I understand, one can still run WordPerfect under a
    DOS emulator on modern x86-64.

    With the poor security of much software, even OSes, one might
    want to contain any legacy software in a more secured
    environment.

    Preventing automatic update is perhaps more of a hassle. Some
    people have placed software in a virtual machine that has no
    networking to avoid software breaking.

    MS wants you to buy Office every time you buy a new PC.

    I thought MS wanted everyone to use Office365. It is harder to
    force people to get a new computer, but a monthly fee will recur automatically.

    When I need a tool--I buy that tool--I never rent that tool.

    Name one feature I would want from office365 that was not already
    present in office from <say> 1998.

    MS, then moves all the menu items to different pull downs and
    makes it difficult to adjust to the new SW--and then it has the
    Gaul to chew up valuable screen space with ever larger pull-
    down bars.

    Ah, but they are just beginning to include advertising. Imagine
    every time one uses the mouse (to indicate to the computer that
    the user's eyes are focused on a particular place) an
    advertisement appears and follows the cursor movement. Even just
    having menu entries that are advertisements would be kind of
    annoying, but one would be able to get rid of those by leasing
    the premium edition (until one needs to lease the platinum
    edition, then the "who wants to remain a millionaire" edition).

    Why would I or anyone want advertising in office ????????

    Is it any wonder users want the 1937 milling machine model ???

    Have no fear; soon you may be merely leasing your computer.
    Computers need to have the latest spyware so that advertisements
    can be appropriately targeted and adblocking must be made
    impossible.

    I am the kind of guy that turns off "telemetry" and places advertisers
    in /hosts file.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 9 21:18:20 2026
    From Newsgroup: comp.arch

    On 2/9/26 2:33 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 2/5/26 2:02 PM, MitchAlsup wrote:

    [snip]>>> MS wants you to buy Office every time you buy a new PC.

    I thought MS wanted everyone to use Office365. It is harder to
    force people to get a new computer, but a monthly fee will recur
    automatically.

    When I need a tool--I buy that tool--I never rent that tool.

    Name one feature I would want from office365 that was not already
    present in office from <say> 1998.

    I do not know if MS can legally cancel your MS Office license,
    and I doubt the few "software pirates" who continue to use an
    unsupported ("invalid") version would be worth MS' time and
    effort to prevent such people from using such software.

    However, there seems to be a strong trend toward "you shall own
    nothing."

    MS, then moves all the menu items to different pull downs and
    makes it difficult to adjust to the new SW--and then it has the
    Gaul to chew up valuable screen space with ever larger pull-
    down bars.

    Ah, but they are just beginning to include advertising. Imagine
    every time one uses the mouse (to indicate to the computer that
    the user's eyes are focused on a particular place) an
    advertisement appears and follows the cursor movement. Even just
    having menu entries that are advertisements would be kind of
    annoying, but one would be able to get rid of those by leasing
    the premium edition (until one needs to lease the platinum
    edition, then the "who wants to remain a millionaire" edition).

    Why would I or anyone want advertising in office ????????

    Why would anyone want advertising in in a Windows Start Menu?

    For Microsoft such provides a bit more revenue/profit as
    businesses seem willing to pay for such advertisements. Have you
    ever heard "You are not the consumer; you are the product"?

    I think I read that some streaming services have added
    advertising to their (formerly) no-advertising subscriptions, so
    the suggested lease term inflation is not completely
    unthinkable.

    Is it any wonder users want the 1937 milling machine model ???

    Have no fear; soon you may be merely leasing your computer.
    Computers need to have the latest spyware so that advertisements
    can be appropriately targeted and adblocking must be made
    impossible.

    I am the kind of guy that turns off "telemetry" and places advertisers
    in /hosts file.

    If all new computers are "leased" (where tampering with the
    device — or not connecting it to the Internet such that it can
    phone home — revokes "ownership" and not merely warranty and one
    agrees to a minimum use [to ensure that enough ads are viewed]),
    ordinary users (who cannot assemble devices from commodity
    parts) would not have a choice. If governments enforce the
    rights of corporations to protect their businesses by outlawing
    sale of computer components to anyone who would work around the
    cartel, owning a computer could become illegal. Governments have
    an interest in having all domestic computers be both secure and
    to facilitate domestic surveillance, so mandating features that
    remove freedom and require an upgrade cycle (which is also good
    for the economy☺) has some attraction.

    I doubt people like you are a sufficient threat to profits that
    such extreme measures will be used, but the world (and
    particularly the U.S.) seems to be becoming somewhat dystopian.

    This is getting kind of off-topic and is certainly not something
    I want to think about.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Tue Feb 10 17:53:10 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    Why would I or anyone want advertising in office ????????

    It is enough if Microsoft wants it... Oh, they'll call it
    "information" or "tips". This was already displayed it in the
    start menu on my work computer some time ago because of some
    IT failure (they failed to turn it off).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Tue Feb 10 14:13:57 2026
    From Newsgroup: comp.arch

    On Mon, 09 Feb 2026 19:33:36 GMT, MitchAlsup
    <[email protected]d> wrote:


    Name one feature I would want from office365 that was not already
    present in office from <say> 1998.

    YMMV, but I'd say OpenDocument (ISO 26300) support.

    Like you, I stayed with Office97 for a long time. I jumped to 2013
    for awhile, briefly toyed with OpenOffice, and finally went to
    LibreOffice and never looked back.

    The biggest problem with Microsoft Office was/is that its various
    versions all had backward incompatibilities, so they could (and did)
    F_ up even working with their own .doc files.


    Why would I or anyone want advertising in office ????????

    LibreOffice and OpenOffice don't have advertising.

    Yes, you do need to get used to different menus / things you use
    frequently being in different places.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Wed Feb 11 15:05:34 2026
    From Newsgroup: comp.arch

    On 09/02/2026 20:33, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 2/5/26 2:02 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    [snip]
    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think). However, doing what is best generally
    for customers is not necessarily the most profitable action.

    Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
    and making optimal battle plans for Little Big Horn battle to come.

    Rather than making battle plans for how to annihilate each
    other, perhaps finding a better solution than the ratting each
    other out in the prisoner's dilemma.

    [snip]
    One can still buy a milling machine built in 1937 and run it in his shop. >>> Can one even do this for software from the previous decade ??

    Yes, but dependency on (proprietary) servers for some games has
    made them (unnecessarily) unplayable.

    From what I understand, one can still run WordPerfect under a
    DOS emulator on modern x86-64.

    With the poor security of much software, even OSes, one might
    want to contain any legacy software in a more secured
    environment.

    Most old software did not have poor security. It was secure by not
    having features that could be abused - and thus no need to worry about
    extra layers to protect said features. MS practically invented the
    concept of insecure applications like word processors - they put
    unnecessary levels of automation and macros, integrated it with email (especially their already hopelessly insecure programs), and so on. No
    real user has any need for "send this document by email" in their word processor - but spam robots loved it. (MS even managed to figure out a
    way to let font files have executable malware in them.) If you go back
    to older tools that did the job they were supposed to do, without trying
    to do everything else, security is a non-issue for most software.

    The 1930's milling machine is safe because it is a milling machine. If
    MS made milling machines, they'd come with built-in beer fridges, TV
    screens and a subscription to sports channels - and in response to
    complaints of users chopping their fingers off, they'd add six layers of security gates that can't be passed without a Windows phone, controlled
    by a HAL 9000 that won't let you mill anything without first begging the
    IT department for permission. Of course, there would still be a small
    hatch at the back where you can put your remaining fingers in to get
    chopped off.


    Preventing automatic update is perhaps more of a hassle. Some
    people have placed software in a virtual machine that has no
    networking to avoid software breaking.

    MS wants you to buy Office every time you buy a new PC.

    I thought MS wanted everyone to use Office365. It is harder to
    force people to get a new computer, but a monthly fee will recur
    automatically.

    When I need a tool--I buy that tool--I never rent that tool.


    Nice in theory (and I fully agree with the aim), but it's getting
    steadily more difficult in practice.

    Name one feature I would want from office365 that was not already
    present in office from <say> 1998.


    Do you mean a /useful/ feature? That makes it a lot harder. What about
    that dancing paper clip? I haven't had any MS Office installed on a PC
    since Word for Windows 2.0 on Win3.11. (I have been a LibreOffice user
    since it's Star Office ancestor - not that I use office suite software
    much.)

    MS, then moves all the menu items to different pull downs and
    makes it difficult to adjust to the new SW--and then it has the
    Gaul to chew up valuable screen space with ever larger pull-
    down bars.

    Ah, but they are just beginning to include advertising. Imagine
    every time one uses the mouse (to indicate to the computer that
    the user's eyes are focused on a particular place) an
    advertisement appears and follows the cursor movement. Even just
    having menu entries that are advertisements would be kind of
    annoying, but one would be able to get rid of those by leasing
    the premium edition (until one needs to lease the platinum
    edition, then the "who wants to remain a millionaire" edition).

    Why would I or anyone want advertising in office ????????

    Why would MS care what /users/ want?


    Is it any wonder users want the 1937 milling machine model ???

    Have no fear; soon you may be merely leasing your computer.
    Computers need to have the latest spyware so that advertisements
    can be appropriately targeted and adblocking must be made
    impossible.

    I am the kind of guy that turns off "telemetry" and places advertisers
    in /hosts file.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Thu Feb 12 10:27:00 2026
    From Newsgroup: comp.arch



    Realize that I'm responding to posts from different people below. I
    hope the attribution is correction.



    On Wed, 11 Feb 2026 15:05:34 +0100, David Brown
    <[email protected]> wrote:

    On 09/02/2026 20:33, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    From what I understand, one can still run WordPerfect under a
    DOS emulator on modern x86-64.

    Yes you can.


    With the poor security of much software, even OSes, one might
    want to contain any legacy software in a more secured
    environment.

    Most old software did not have poor security. It was secure by not
    having features that could be abused - and thus no need to worry about
    extra layers to protect said features. MS practically invented the
    concept of insecure applications like word processors - they put
    unnecessary levels of automation and macros, integrated it with email >(especially their already hopelessly insecure programs), and so on. No
    real user has any need for "send this document by email" in their word >processor - but spam robots loved it. (MS even managed to figure out a
    way to let font files have executable malware in them.) If you go back
    to older tools that did the job they were supposed to do, without trying
    to do everything else, security is a non-issue for most software.

    Automation and macros? By that definition, you could argue that
    WordStar invented insecurity (on micros), and everyone else followed
    its bad example.

    [You also could argue that the TECO editor on Unix was the origin )a
    decade before WordStar), but the Unix environment made it more
    difficult to cause any /major/ havoc with a dangerous editor macro.]

    Adding networking to CP/M, or DOS, or [early] Windows, just amplified
    the problem by making it easier to share and exchange files. The
    insecure OSes, combined with too powerful macro systems, made it
    relatively easy to destroy the whole system.


    MS wants you to buy Office every time you buy a new PC.

    I thought MS wanted everyone to use Office365. It is harder to
    force people to get a new computer, but a monthly fee will recur
    automatically.

    When I need a tool--I buy that tool--I never rent that tool.

    There won't be a choice. Sooner or later, Microsoft will stop selling
    software with perpetual licenses.

    Yet another reason to stop using their stuff.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 16 16:14:28 2026
    From Newsgroup: comp.arch

    On 11/5/25 2:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    [email protected] (Anton Ertl) posted:
    [snip]
    Branch prediction is fun.

    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes
    for the branch predictors.

    You might be interested in looking at the 6th Championship
    Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th- championship-branch-prediction-cbp2025/

    TAgged GEometric length predictors (TAGE) seem to the current
    "hotness" for branch predictors. These record very long global
    histories and fold them into shorter indexes with the number of
    history bits used varying for different tables.

    (Because the correlation is less strong, 3-bit counters are
    generally used as well as a useful bit.)

    But, then always at the end of it using 2-bit saturating counters:
      weakly taken, weakly not-taken, strongly taken, strongly not
    taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
      Keep a local history of taken/not-taken;
      XOR this with the low-order-bits of PC for the table index;
      Use a 5/6-bit finite-state-machine or similar.
        Can model repeating patterns up to ~ 4 bits.

    Indexing a predictor by _local_ (i.e., per instruction address)
    history adds a level of indirection; once one has the branch
    (fetch) address one needs to index the local history and then
    use that to index the predictor. The Alpha 21264 had a modest-
    sized (by modern standards) local history predictor with a 1024-
    entry table of ten history bits indexed by the Program Counter
    and the ten bits were used to index a table of three-bit
    counters. This was combined with a 12-bit global history
    predictor with 2-bit counters (note: not gshare, i.e., xored
    with the instruction address) and the same index was used for
    the chooser.

    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.

    TAGE has the advantage that the tags reduce branch aliases and
    the variable history length (with history folding/compression)
    allows using less storage (when a prediction only benefits from
    a shorter history) and reduces training time.

    (I am a little surprised that I have not read a suggestion to
    use the alternate 2-bit encoding that retains the last branch
    direction. This history might be useful for global history; the
    next most recent direction (i.e., not the predicted direction)
    of previous recent branches for a given global history might be
    useful in indexing a global history predictor. This 2-bit
    encoding seems to give slightly worse predictions than a
    saturating counter but the benefit of "localized" global history
    might compensate for this.)

    Alloyed prediction ("Alloyed Branch History: Combining Global
    and Local Branch History for Robust Performance", Zhijian Lu et
    al., 2002) used a tiny amount of local history to index a
    (mostly) global history predictor, hiding (much of) the latency
    of looking up the local history by retrieving multiple entries
    from the table and selecting the appropriate one with the local
    history.

    There was also a proposal ("Branch Transition Rate: A New Metric
    for Improved Branch Classification Analysis", Michael Huangs et
    al., 2000) to consider transition rate, noting that high
    transition rate branches (which flip direction frequently) are
    poorly predicted by averaging behavior. (Obviously, loop-like
    branches have a high transition rate for one direction.) This is
    a limited type of local history. If I understand correctly, your
    state machine mechanism would capture the behavior of such
    highly alternately branches.

    Compressing the history into a pattern does mean losing
    information (as does a counter), but I had thought such pattern
    storage might be more efficient that storing local history. It
    is also interesting that the Alpha 21264 local predictor used
    dynamic pattern matching rather than a static transformation of
    history to prediction (state machine)

    I think longer local history prediction has become unpopular,
    probably because nothing like TAGE was proposed to support
    longer histories but also because the number of branches that
    can be tracked with long histories is smaller.

    Local history patterns may also be less common that statistical
    correlation after one has extracted branches predicted well by
    global history. (For small-bodied loops, a moderately long
    global history provides substantial local history.)

    Using a pattern/state machine may also make confidence
    estimation less accurate. TAGE can use confidence of multiple
    matches to form a prediction.

    The use of confidence for making a prediction also makes it
    impractical to store just a prediction nearby (to reduce
    latency) and have the extra state more physically distant. For
    per-address predictors, one could in theory use Icache
    replacement to constrain predictor size, where an Icache miss
    loads local predictions from an L2 local predictor. I think AMD
    had limited local predictors associated with the Icache and had
    previously stored some prediction information in L2 cache using
    the fact that code is not modified (so parity could be used and
    not ECC).

    Daniel A. Jiménez did some research on using neural methods
    (e.g., perceptrons) for branch prediction. The principle was
    that traditional global history tables had exponential scaling
    with history size (2 to the N table entries for N history bits)
    while per-address perceptrons would scale linearly (for a fixed
    number of branch addresses). TAGE (with its folding of long
    histories and variable history) seems to have removed this as a
    distinct benefit. General correlation with specific branches may
    also be less predictive than correlation with path.
    Nevertheless, the research was interesting and larger histories
    did provide more accurate predictions.

    Anyway, I agree that thinking about these things can be fun.





    Where, the idea was that the state-machine in updated with the
    current state and branch direction, giving the next state and
    next predicted branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit
    saturating counters, but it is sort of a partial mystery why
    (for mainstream processors) more complex lookup schemes and 2
    bit state, was preferable to a simpler lookup scheme and 5-bit
    state.

    Well, apart from the relative "dark arts" needed to cram 4-bit
    patterns into a 5 bit FSM (is a bit easier if limiting the
    patterns to 3 bits).



    Then again, had before noted that the LLMs are seemingly also
    not really able to figure out how to make a 5 bit FSM to model a
    full set of 4 bit patterns.


    Then again, I wouldn't expect it to be all that difficult of a
    problem for someone that is "actually smart"; so presumably chip
    designers could have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage
    would cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative
    predictive weakness of 2-bit state) would have costed more than
    the cost of smaller tables of 6 bit state ?...

    Say, for example, 2b:
     00_0 => 10_0  //Weakly not-taken, dir=0, goes strong not-taken
     00_1 => 01_0  //Weakly not-taken, dir=1, goes weakly taken
     01_0 => 00_1  //Weakly taken, dir=0, goes weakly not-taken
     01_1 => 11_1  //Weakly taken, dir=1, goes strongly taken
     10_0 => 10_0  //strongly not taken, dir=0
     10_1 => 00_0  //strongly not taken, dir=1 (goes weak)
     11_0 => 01_1  //strongly taken, dir=0
     11_1 => 11_1  //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
      As above, and 4-more alternating states
      And slightly different transition logic.
    Say (abbreviated):
      000   weak, not taken
      001   weak, taken
      010   strong, not taken
      011   strong, taken
      100   weak, alternating, not-taken
      101   weak, alternating, taken
      110   strong, alternating, not-taken
      111   strong, alternating, taken
    The alternating states just flip-flopping between taken and not
    taken.
      The weak states can more between any of the 4.
      The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty
    when getting to 4 bit patterns though (and can't fit both weak
    and strong states for longer patterns, so the 4b patterns
    effectively only exist as weak states which partly overlap with
    the weak states for the 3-bit patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within
    a 6 bit state might be impossible. Although there would be
    sufficient state-space for the looping 5-bit patterns, there may
    not be sufficient state-space to distinguish whether to move
    from a mismatched 4-bit pattern to a 3 or 5 bit pattern.
    Whereas, at least with 4-bit, any mismatch of the 4-bit pattern
    can always decay to a 3-bit pattern, etc. One needs to be able
    to express decay both to shorter patterns and to longer
    patterns, and I suspect at this point, the pattern breaks down
    (but can't easily confirm; it is either this or the pattern
    extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle
    or something...

    Then again, maybe other people would not find any particular
    difficulty in these sorts of tasks.


    Terje



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 16 17:24:14 2026
    From Newsgroup: comp.arch

    On 11/18/25 10:16 AM, Anton Ertl wrote:
    [snip]
    How can register renaming be implemented on SPARC? As discussed
    above, this can be done independently: Have 96 architectural registers
    (plus the window pointer), and make 8 of them global registers, and
    the rest 24 visible registers plus 4 windows of 16 registers, with the
    usual switching. And then rename these 96 architectural registers.

    Rather than having all the architectural registers be "visible"
    in a single RAT structure, one could use a technique similar to
    RAT checkpointing (used for reproducing state after a branch
    misprediction; MIPS R10000 had a 4 entry "branch stack" that
    included mapping table copies). The window shifts would adjust
    which block specific RAT subtables provide mappings for and
    blocks shifted out would behave like RAT checkpoints.

    A variant in the opposite direction would be to treat only the 32
    visible registers as architectural registers, avoiding large RAT
    entries. The save instruction would emit store microinstructions for
    the local and in registers, and then the renamer would rename the out registers to the in registers, and would assign 0 to the local and out registers (which would not occupy a physical register at first). This approach makes the most sense with a separate renamer as is now
    common. The restore instruction would rename the in registers to the
    out registers, and emit load microinstructions for the local and the
    in registers.
    Another possibility would be to use special register storage for
    checkpoint values which only supports transfers to and from a
    multiported entry. With the high wire overhead of multiported
    register files, such storage can be substantially hidden under
    the wires. (For in-order designs, Sun had the (academic-only?)
    concept of 3D register files that used select lines to choose
    which set of registers was being used. Aside from register
    windows for an in-order core, this could be used for fine-
    grained but not simultaneous multithreading.)

    (I have not found the paper that suggested such checkpoint
    registers — the intent as I recall was to reduce the storage
    overhead for values which were likely dead but still within the
    speculative window. The 3D register file paper has "Three-
    Dimensional" in the title so is easy to find among my files: "A
    Three Dimensional Register File For Superscalar Processors",
    Marc Tremblay et al., 1995. I wish I had kept better notes about
    the papers I have looked at.)


    OoO tends to work fine with storing around calls and loading around
    returns in architectures without register windows, because the storing
    mostly consumes resources, but is not on the critical path, and
    likewise for the loading (the loads tend to be ready earlier than the instructions on the critical path); and store-to-load forwarding
    deals with the problem of a return shortly after a call.

    I think lazy saving might be preferred over treating such as
    ordinary stores. Also storing to a dedicated single-ported wide
    and multibanked storage might be desirable. In some sense, a
    register window is a single 256-byte register when referenced by
    SAVE and RESTORE instructions (if I remember/understand
    correctly), so tracking loads and stores by individual 8-byte
    chunks seems less efficient. In addition such would not require
    full address tagging. I have no clue whether it would be
    worthwhile to specialize these cases for efficiency rather than
    use more general resources (with higher utilization).

    I would guess that for individual store-data operations, the
    operation would also include information to free the register so
    that it is not freed before the data is read but is potentially
    freed before the data reaching the longer-term storage.

    Since wide accesses are cheaper than multiple independent
    accesses, there might be a preference to batch the stores. On
    the other hand, if using the data cache for storage, separate
    stores could be done opportunistically to exploit unused banks.

    Given the predictability of demand for saved values, one might
    be able to move the storage farther away, exploiting the
    predictability (and pipeline latency) to hide the latency of
    accessing the storage.

    Even though high-performance SPARC is effectively dead, it is
    interesting to think about such things.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 16 18:04:27 2026
    From Newsgroup: comp.arch

    On 12/2/25 2:55 PM, MitchAlsup wrote:

    Robert Finch <[email protected]> posted:

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for
    protection or translation of the address.

    You can determine is an access is misaligned "enough" to warrant two
    trips down the pipe.
    a) crosses cache width
    b) crosses page boundary

    Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

    While that might be true of all practical designs, I _think_
    that with a virtually tagged cache even crossing a page boundary
    might not be a problem.

    (I also discovered a way of using overlaid skewed associativity
    to support variable alignment of cache blocks, so one could have
    64-byte cache blocks aligned at 32-byte boundaries. Within a
    page (or contiguous physical address chunk) physical tagging
    could be used for such "misaligned" blocks. This was intended
    primarily for instruction caches where contiguous useful chunks
    are more common, wide access is common, and the chunks may not
    be within an ordinary aligned block. With such cache blocks,
    crossing a block size alignment boundary would not require a
    second tag check. Of course, that is a special case and assumes
    a design that is unlikely to be used.)

    A cache that recorded the way of the next cache block might be
    able to avoid both TLB and tag checks. (One of the Alpha Icache
    designs had a next block _predictor_ so such a next adjacent
    block designator might not be so crazy that the proposer would
    be sent to an asylum.)

    With a direct-mapped cache, accessing two adjacent blocks in
    parallel would be somewhat easier since one would not need way
    prediction or determination to do the access.

    I would also think that if two cache tags could be checked in
    parallel (even if constrained to "paired" tags) one could do the
    same with a TLB (a clustered TLB has multiple adjacent pages so
    the constraint would be more about crossing a page-cluster
    boundary, I think).

    I *am* skeptical that supporting page-crossing (or even block-
    crossing) accesses is important enough to justify a lot of
    complexity and extra hardware, but it seems that such would not
    be impossible even without general two-entry TLB probes.

    If I understand correctly, for the data access itself, there is
    no significant difference between unaligned at the SRAM array
    level and unaligned at the cache block level. I.e., the issue is
    the tags.

    If one allows multiple accesses to exploit the bank/bandwidth
    support of the cache provided for unaligned accesses, supporting
    block-crossing and page crossing loads may not be unreasonable
    (it seems to me). If one supports three loads per cycle, one may
    already support three TLB look-ups; if two of the loads are in
    the same page, then the third load could use two TLB-access
    slots. For streaming accesses, the TLB entry might be cached
    separately, so an end-of-page unaligned access might use the
    separately cached TLB entry and only make one access to the
    general TLB.

    I recognize that there is a huge difference between physically
    possible and practical or worthwhile, but 'never' and 'always'
    seem to be trigger words for me, "forcing" me to find an
    exception.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 16 20:05:50 2026
    From Newsgroup: comp.arch

    On 2/6/26 10:54 AM, John Dallman wrote:
    In article <10m12ue$2t2k5$[email protected]>, [email protected] (Paul Clayton) wrote:

    Currently, assembly-level compatibility does not seem worthwhile.

    Not now, no. There was one case where it was valuable: the assembler
    source translator for 8080 to 8086. That plus the resemblance of early
    MS-DOS to CP/M meant that CP/M software written in assembler could be got working on the early IBM PC and compatibles more rapidly than new
    software could be developed in high-level languages. That was one of the factors in the runaway success of PC-compatible machines in the early
    1980s.

    I remember reading about the 8080 → 8086 assembly translator. I
    did not know that CP/M and MS-DOS were similar enough to
    facilitate porting, so that note was interesting to me.

    Software is usually distributed as machine code binaries not as
    assembly,

    Or as source code...

    I think source distribution, even of free/open-source (or
    shrouded source or other source-available) software, was not the
    common case. I am lazy enough to use Fedora Linux.

    Would the Intel-64 have been assembly compatible with AMD64? I
    would have guessed that not just encodings would have been
    different. If one wants to maintain market friction, supporting
    the same assembly seems counterproductive.

    It would hardly have mattered. Very little assembler is written for
    64-bit architectures.

    Quite true, of course.

    Cooperating with AMD to develop a more sane encoding while
    supporting low overhead for old binaries would have been better
    for customers (I think).

    Intel didn't admit to themselves they needed to do 64-bit x86 until AMD64
    was thrashing them in the market. Far too late for collaborative design
    by then.

    Yes, seeking peace the morning before the Little Big Horn would
    not have been an effective strategy. Intel presumably thought
    Itanium would be the only merchant 64-bit ISA that mattered (and
    this would exclude AMD) and that the masses could use 32-bit
    until less expensive Itanium processors were possible.

    It is not clear if assembly programmers would use less efficient
    abstractions (like locks) to handle concurrency in which case
    a different memory model might not impact correctness.

    You are thinking of doing application programming in assembler. That's
    pretty much extinct these days. Use of assembler to implement locks or
    other concurrency-control mechanisms in an OS or a language run-time
    library is far more likely.

    I've been doing low-level parts of application development for over 40
    years. In 1983-86, I was working in assembler, or needed to have a very
    close awareness of the assembler code being generated by a higher level language. In 1987-1990, I needed to be able to call assembler-level OS functions from C code. Since then, the only coding I've done in assembler
    has been to generate hardware error conditions for testing error handlers. I've read and debugged lots of compiler-generated assembler to report compiler bugs, but that has become far less common over time.

    I did think that assembly was mostly a niche skill nowadays.
    Your little history is interesting.

    I do not think ISA heterogeneity is necessarily problematic.

    It requires the OS scheduler to be ISA-aware, and to never, /ever/ put a thread onto a core that can't run the relevant ISA. That will inevitably
    make the scheduler more complicated and thus increase system overheads.

    I think it depends on how much heterogeneity there is. If an
    unimplemented instruction exception is generated, a thread could
    be migrated to a core supporting that instruction. The OS might
    even then mark that application as requiring that feature. Of
    course, this also prevents using a less feature-ful core for
    programs that only occasionally use a feature.

    I agree that such would add complexity, but there is already
    complexity for power saving with same ISA heterogeneity. NUMA-
    awareness, cache sharing, and cache warmth also complicate
    scheduling, so the question becomes how much extra complexity
    does such introduce.

    For SIMD width — apart from the broken, in my opinion, memory
    copy implementations — a lot of programs probably do not use
    SIMD. (That may be changing slowly as single thread performance
    improvement has slowed. Maybe?) While I agree that SIMD width
    was not a good basis for the heterogeneity, it was the feature
    that differed for the Intel chip.

    I suspect it might require more system-level organization (similar
    to Apple).

    Have you ever tried to optimise multi-threaded performance on a modern
    Apple system with a mixture of Performance and Efficiency cores? I have,
    and it's a lot harder than Apple give the impression it will be.

    Apple make an assumption: that you will use their "Grand Central Dispatch" threading model. That requires multi-threaded code to be structured as a one-direction pipeline of work packets, with buffers between them, and
    one thread/core per pipeline stage. That's a sensible model for some
    kinds of work, but not all kinds. It also requires compiler extensions
    which don't exist on other compilers. So you have to fall back to POSIX threads to get flexibility and portability.

    Interesting.

    If you're using POSIX threads, the scheduler seems to assign threads to
    cores randomly. So your worker threads spend a lot of time on Efficiency cores. Those are in different clusters from the Performance cores, which means that communications between threads (via locks) are very slow.
    Using Apple's performance category attributes for threads has no obvious effect on this.

    While there would be some inevitable slowness in communication
    between clusters (and power domains), I wonder if this was a
    chosen simplification in Apple's case. It makes sense to not
    bother optimizing a use case one does not expect to happen.

    The way to fix this is to find out how many Performance cores there are
    in a Performance cluster (which wasn't possible until macOS 12) and use
    that many threads. Then you need to reach below the POSIX threading layer
    to the underlying BSD thread layer. There, you can set an association
    number on your threads, which tells the scheduler to try to run them in
    the same cluster. Then you get stable and near-optimal performance. But finding out how to do this is fairly hard, and few seem to managed it.

    Also interesting.

    Even without ISA heterogeneity, optimal scheduling
    seems to be a hard problem. Energy/power and delay/performance
    preferences are not typically expressed. The abstraction of each
    program owning the machine seems to discourage nice behavior (pun
    intended).

    Allowing processes to find out the details of other processes' resource
    usage makes life very complicated, and introduces new opportunities for security bugs.

    I still feel an attraction to a market-oriented resource
    management such that threads could both minimize resource use
    (that might be more beneficial to others) and get more than a
    fair-share of resources that are important.

    (I thought Intel marketed their initial 512-bit SIMD processors
    as GPGPUs with x86 compatibility, so the idea of having a
    general purpose ISA morphed into a GPU-like ISA had some
    fascination after Cell.)

    Larabee turned out to be a pretty bad GPU, and a pretty bad set of CPUs.

    Yeah. I was commenting more on the spirit of the age or the
    seductiveness of some ideas.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Wed Feb 18 01:04:10 2026
    From Newsgroup: comp.arch

    Used a new acronym today - HHI standing for hardware based hardware
    interrupt. An interrupt thats runs a hardware process instead of an
    interrupt subroutine.

    The Qupls co-processor has an interrupt tied to the graphics command
    queue being non-empty. It then triggers graphics operations instead of
    running an ISR.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Wed Feb 18 14:45:36 2026
    From Newsgroup: comp.arch

    On 2/16/2026 3:14 PM, Paul Clayton wrote:
    On 11/5/25 2:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    [email protected] (Anton Ertl) posted:
    [snip]
    Branch prediction is fun.

    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for
    the branch predictors.

    You might be interested in looking at the 6th Championship
    Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th- championship-branch-prediction-cbp2025/


    Quick look, didn't see much information about who entered or won...


    TAgged GEometric length predictors (TAGE) seem to the current
    "hotness" for branch predictors. These record very long global
    histories and fold them into shorter indexes with the number of
    history bits used varying for different tables.

    (Because the correlation is less strong, 3-bit counters are
    generally used as well as a useful bit.)


    When I messed with it, increasing the strength of the saturating
    counters was not effective.

    But, increasing the ability of them to predict more complex patterns did
    help.



    But, then always at the end of it using 2-bit saturating counters:
       weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
       Keep a local history of taken/not-taken;
       XOR this with the low-order-bits of PC for the table index;
       Use a 5/6-bit finite-state-machine or similar.
         Can model repeating patterns up to ~ 4 bits.

    Indexing a predictor by _local_ (i.e., per instruction address)
    history adds a level of indirection; once one has the branch
    (fetch) address one needs to index the local history and then
    use that to index the predictor. The Alpha 21264 had a modest-
    sized (by modern standards) local history predictor with a 1024-
    entry table of ten history bits indexed by the Program Counter
    and the ten bits were used to index a table of three-bit
    counters. This was combined with a 12-bit global history
    predictor with 2-bit counters (note: not gshare, i.e., xored
    with the instruction address) and the same index was used for
    the chooser.


    OK, seems I wrote it wrong:
    I was using a global branch history.

    But, either way, the history is XOR'ed with the relevant bits from PC to generate the index.


    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.


    If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
    counters, it can be a win.


    As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

    With the PC XOR Hist, lookup, there were still quite a few patterns, and
    not just contiguous 0s or 1s that the saturating counter would predict,
    nor just all 0s, all 1s, or ...101010... that the 3-bit FSM could deal with.


    But, a bigger history could mean less patterns and more contiguous bits
    in the state.



    TAGE has the advantage that the tags reduce branch aliases and
    the variable history length (with history folding/compression)
    allows using less storage (when a prediction only benefits from
    a shorter history) and reduces training time.


    In my case, was using 6-bit lookup mostly to fit into LUT6 based LUTRAM.

    Going bigger than 6 bits here is a pain point for FPGAs, more so as
    BRAM's don't support narrow lookups, so the next size up would likely be
    2048x but then inefficiently using the BRAM (there isn't likely a
    particularly good way to make use of 512x 18-bits).


    Though, more efficient case here would likely be to use LUT5's (5-bit
    index), or maybe MUXing LUT5's (for 128x 6b, maybe 512x 6b with 3-levels
    of LUTs).


    I guess, it is an open question if one did:
    reg[4:0] predArray[511:0];
    If the synthesis would figure out the most optimal pattern on its own,
    vs needing to get more manual, with an array more like:
    reg[95:0] predArray[31:0]; //fit to 5b index / 3b data.

    With it padded up to 6-bits per entry to allow fitting it to the LUTRAM
    1R1W pattern (densely packing to 80 bits would not fit the pattern).


    But, yeah, if the size of the lookup were extended, it could make sense
    to start hashing part of the history, say, maybe:

    hhist[ 5: 0] = hist[5:0];
    hhist[ 9: 6] = hist[9:6] ^ hist[13:10];
    hhist[11:10] = hist[11:10] ^ hist[13:12] ^ hist[15:14];

    Then, say:
    index[11:0] = hhist[11:0] ^ { pc[12:2], pc[13] ^ pc[1] };

    ...


    (I am a little surprised that I have not read a suggestion to
    use the alternate 2-bit encoding that retains the last branch
    direction. This history might be useful for global history; the
    next most recent direction (i.e., not the predicted direction)
    of previous recent branches for a given global history might be
    useful in indexing a global history predictor. This 2-bit
    encoding seems to give slightly worse predictions than a
    saturating counter but the benefit of "localized" global history
    might compensate for this.)

    Alloyed prediction ("Alloyed Branch History: Combining Global
    and Local Branch History for Robust Performance", Zhijian Lu et
    al., 2002) used a tiny amount of local history to index a
    (mostly) global history predictor, hiding (much of) the latency
    of looking up the local history by retrieving multiple entries
    from the table and selecting the appropriate one with the local
    history.

    There was also a proposal ("Branch Transition Rate: A New Metric
    for Improved Branch Classification Analysis", Michael Huangs et
    al., 2000) to consider transition rate, noting that high
    transition rate branches (which flip direction frequently) are
    poorly predicted by averaging behavior. (Obviously, loop-like
    branches have a high transition rate for one direction.) This is
    a limited type of local history. If I understand correctly, your
    state machine mechanism would capture the behavior of such
    highly alternately branches.

    Compressing the history into a pattern does mean losing
    information (as does a counter), but I had thought such pattern
    storage might be more efficient that storing local history. It
    is also interesting that the Alpha 21264 local predictor used
    dynamic pattern matching rather than a static transformation of
    history to prediction (state machine)

    I think longer local history prediction has become unpopular,
    probably because nothing like TAGE was proposed to support
    longer histories but also because the number of branches that
    can be tracked with long histories is smaller.

    Local history patterns may also be less common that statistical
    correlation after one has extracted branches predicted well by
    global history. (For small-bodied loops, a moderately long
    global history provides substantial local history.)


    ...

    It seems what I wrote originally was inaccurate, I don't store a history per-target, merely it was recent taken/not-taken branches.

    But, I no longer remember what I was thinking at the time, or why I had written local history rather than global history (unless I meant "local"
    in terms of recency or something, I don't know).



    Using a pattern/state machine may also make confidence
    estimation less accurate. TAGE can use confidence of multiple
    matches to form a prediction.

    The use of confidence for making a prediction also makes it
    impractical to store just a prediction nearby (to reduce
    latency) and have the extra state more physically distant. For
    per-address predictors, one could in theory use Icache
    replacement to constrain predictor size, where an Icache miss
    loads local predictions from an L2 local predictor. I think AMD
    had limited local predictors associated with the Icache and had
    previously stored some prediction information in L2 cache using
    the fact that code is not modified (so parity could be used and
    not ECC).

    Daniel A. Jiménez did some research on using neural methods
    (e.g., perceptrons) for branch prediction. The principle was
    that traditional global history tables had exponential scaling
    with history size (2 to the N table entries for N history bits)
    while per-address perceptrons would scale linearly (for a fixed
    number of branch addresses). TAGE (with its folding of long
    histories and variable history) seems to have removed this as a
    distinct benefit. General correlation with specific branches may
    also be less predictive than correlation with path.
    Nevertheless, the research was interesting and larger histories
    did provide more accurate predictions.

    Anyway, I agree that thinking about these things can be fun.


    A lot of this seems a lot more complex though than what would be all
    that practical on a Spartan or Artix class FPGA.


    I was mostly using 5/6 bit state machines as they gave better results
    than 2-bit saturating counters, and fit nicely within the constraints of
    a "history XOR PC" lookup pattern.


    Also initially, a while ago, it seemed that the patterns for these state machines were outside the reach of LLM AIs, but it seems the AIs are
    quickly catching up at these sorts of mental puzzles.

    Granted, I am mostly within the limits of cheap/free.


    But, OTOH, paying a steep per-month fee so that "vibe coding" could take
    my hobby from me isn't particularly compelling personally. But, even as
    such, it does seem like if people can do so, my "existential merit" is weakening.

    Even if as-is, apparently the people who try to do non-trivial projects
    via "vibe coding" generally end up with some big mess of poorly written
    code that falls on its face one the code gets big enough that the LLMs
    can no longer reason about it.







    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating
    counters, but it is sort of a partial mystery why (for mainstream
    processors) more complex lookup schemes and 2 bit state, was
    preferable to a simpler lookup scheme and 5-bit state.

    Well, apart from the relative "dark arts" needed to cram 4-bit
    patterns into a 5 bit FSM (is a bit easier if limiting the patterns to
    3 bits).



    Then again, had before noted that the LLMs are seemingly also not
    really able to figure out how to make a 5 bit FSM to model a full set
    of 4 bit patterns.


    Then again, I wouldn't expect it to be all that difficult of a problem
    for someone that is "actually smart"; so presumably chip designers
    could have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage would
    cost more than 2 bits, but then presumably needing to have
    significantly larger tables (to compensate for the relative predictive
    weakness of 2-bit state) would have costed more than the cost of
    smaller tables of 6 bit state ?...

    Say, for example, 2b:
      00_0 => 10_0  //Weakly not-taken, dir=0, goes strong not-taken
      00_1 => 01_0  //Weakly not-taken, dir=1, goes weakly taken
      01_0 => 00_1  //Weakly taken, dir=0, goes weakly not-taken
      01_1 => 11_1  //Weakly taken, dir=1, goes strongly taken
      10_0 => 10_0  //strongly not taken, dir=0
      10_1 => 00_0  //strongly not taken, dir=1 (goes weak)
      11_0 => 01_1  //strongly taken, dir=0
      11_1 => 11_1  //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
       As above, and 4-more alternating states
       And slightly different transition logic.
    Say (abbreviated):
       000   weak, not taken
       001   weak, taken
       010   strong, not taken
       011   strong, taken
       100   weak, alternating, not-taken
       101   weak, alternating, taken
       110   strong, alternating, not-taken
       111   strong, alternating, taken
    The alternating states just flip-flopping between taken and not taken.
       The weak states can more between any of the 4.
       The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty when
    getting to 4 bit patterns though (and can't fit both weak and strong
    states for longer patterns, so the 4b patterns effectively only exist
    as weak states which partly overlap with the weak states for the 3-bit
    patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6
    bit state might be impossible. Although there would be sufficient
    state-space for the looping 5-bit patterns, there may not be
    sufficient state-space to distinguish whether to move from a
    mismatched 4-bit pattern to a 3 or 5 bit pattern. Whereas, at least
    with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-
    bit pattern, etc. One needs to be able to express decay both to
    shorter patterns and to longer patterns, and I suspect at this point,
    the pattern breaks down (but can't easily confirm; it is either this
    or the pattern extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle or
    something...

    Then again, maybe other people would not find any particular
    difficulty in these sorts of tasks.


    Terje




    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Wed Feb 18 15:51:04 2026
    From Newsgroup: comp.arch

    On 2/9/2026 8:18 PM, Paul Clayton wrote:
    On 2/9/26 2:33 PM, MitchAlsup wrote:

    Paul Clayton <[email protected]> posted:

    On 2/5/26 2:02 PM, MitchAlsup wrote:

    [snip]>>> MS wants you to buy Office every time you buy a new PC.

    I thought MS wanted everyone to use Office365. It is harder to
    force people to get a new computer, but a monthly fee will recur
    automatically.

    When I need a tool--I buy that tool--I never rent that tool.

    Name one feature I would want from office365 that was not already
    present in office from <say> 1998.

    I do not know if MS can legally cancel your MS Office license, and I
    doubt the few "software pirates" who continue to use an unsupported ("invalid") version would be worth MS' time and effort to prevent such people from using such software.

    However, there seems to be a strong trend toward "you shall own nothing."

    MS, then moves all the menu items to different pull downs and
    makes it difficult to adjust to the new SW--and then it has the
    Gaul to chew up valuable screen space with ever larger pull-
    down bars.

    Ah, but they are just beginning to include advertising. Imagine
    every time one uses the mouse (to indicate to the computer that
    the user's eyes are focused on a particular place) an
    advertisement appears and follows the cursor movement. Even just
    having menu entries that are advertisements would be kind of
    annoying, but one would be able to get rid of those by leasing
    the premium edition (until one needs to lease the platinum
    edition, then the "who wants to remain a millionaire" edition).

    Why would I or anyone want advertising in office ????????

    Why would anyone want advertising in in a Windows Start Menu?

    For Microsoft such provides a bit more revenue/profit as businesses seem willing to pay for such advertisements. Have you ever heard "You are not
    the consumer; you are the product"?

    I think I read that some streaming services have added
    advertising to their (formerly) no-advertising subscriptions, so
    the suggested lease term inflation is not completely
    unthinkable.


    Better at this point to just use LibreOffice or similar...

    Well, and to not use Windows 11 ...


    For now, my main PC still uses Windows 10, and at this point would
    almost rather jump ship to Linux if need be, than go to Windows 11.

    Like, MS has become hell bent on turning Windows 11 into a trash fire.


    Is it any wonder users want the 1937 milling machine model ???

    Have no fear; soon you may be merely leasing your computer.
    Computers need to have the latest spyware so that advertisements
    can be appropriately targeted and adblocking must be made
    impossible.

    I am the kind of guy that turns off "telemetry" and places advertisers
    in /hosts file.

    If all new computers are "leased" (where tampering with the
    device — or not connecting it to the Internet such that it can
    phone home — revokes "ownership" and not merely warranty and one
    agrees to a minimum use [to ensure that enough ads are viewed]),
    ordinary users (who cannot assemble devices from commodity
    parts) would not have a choice. If governments enforce the
    rights of corporations to protect their businesses by outlawing
    sale of computer components to anyone who would work around the
    cartel, owning a computer could become illegal. Governments have
    an interest in having all domestic computers be both secure and
    to facilitate domestic surveillance, so mandating features that
    remove freedom and require an upgrade cycle (which is also good
    for the economy☺) has some attraction.

    I doubt people like you are a sufficient threat to profits that
    such extreme measures will be used, but the world (and
    particularly the U.S.) seems to be becoming somewhat dystopian.

    This is getting kind of off-topic and is certainly not something I want
    to think about.

    Ironically, this sort of thing, and also locking down computers enough
    that they only allow basic user programs and disallow "side loading"
    etc, was an element in some of my sci-fi stories.

    But, not exactly an optimistic point.

    And, there was effectively an illicit black market for unconstrained
    computers and computer parts (mostly salvaged). Where, a computer built
    mostly from salvaged parts from old electronics would be worth more than
    a new computer available though official channels (and within legal
    limits in terms of OS and hardware specs).


    Though, in such a world, owning an unconstrained computer would be seen
    as both illegal and dangerous.

    But, not all sci-fi needs to be utopic or optimistic (this itself seems
    like a trap, both that people assume this as a default, or that people
    can mistake overt dystopias as an ideal to strive for).

    But, then if one includes "obvious bad" things (like a bunch of WWII
    type stuff; mass euthanasia and so on), then almost invariably someone
    thinks that one is endorsing it and gets offended about it.

    Well, and sometimes one needs to be able to reference and depict bad
    things in order to denounce them as such.

    Granted, things are not so great when one is more prone to dealing with
    "gray on gray morality" rather than a more clear cut "battle of good
    versus evil" theme. Reality more tends towards the former, but society
    prefers the latter, and choosing the latter more often leads one into a
    trap (more so if one assumes that the protagonists' side is always
    necessarily the "good" one).

    Say, flip the perspective, and tell a story from the POV of the villain,
    and almost invariably they become an anti-hero even if they continue
    pretty much the exact same actions they would have taken if a villain
    from the hero's POV (more so if they are humanized in any way, or their actions are given any sort of justification, even if said justification
    is purely self-serving and egocentric).

    Alas...


    ...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Thu Feb 19 08:02:00 2026
    From Newsgroup: comp.arch

    In article <10n2u02$270jc$[email protected]>, [email protected] (Paul Clayton) wrote:

    I remember reading about the 8080 _ 8086 assembly translator. I
    did not know that CP/M and MS-DOS were similar enough to
    facilitate porting, so that note was interesting to me.

    /Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have hierarchical directories. It didn't really support hard disks. The
    CP/M-style APIs all carried on existing after MS-DOS 2.0 introduced a new
    set of APIs that were more suitable for high-level languages, but they
    weren't much used un new software.

    Intel presumably thought Itanium would be the only merchant
    64-bit ISA that mattered (and this would exclude AMD) and
    that the masses could use 32-bit until less expensive Itanium
    processors were possible.

    Pretty much. Then the struggle to make Itanium run fast became the
    overpowering concern, until they gave up and concentrated on x86-64,
    claiming that Itanium would be back in a few years.

    I don't think many people took that claim seriously. Some years later, an
    Intel marketing man was quite shocked to hear that, and that the world
    had simply been humouring them.

    I agree that such would add complexity, but there is already
    complexity for power saving with same ISA heterogeneity. NUMA-
    awareness, cache sharing, and cache warmth also complicate
    scheduling, so the question becomes how much extra complexity
    does such introduce.

    If the behaviour of Apple's OSes are any guide, complexity is avoided as
    far as possible.

    I still feel an attraction to a market-oriented resource
    management such that threads could both minimize resource use
    (that might be more beneficial to others) and get more than a
    fair-share of resources that are important.

    The difficulty there is that developers will have a very hard time
    creating /measurable/ speed-ups that apply across a wide range of
    different configurations. Companies will therefore be reluctant to put developer hours into it that could go into features that customers are
    asking for.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Feb 19 05:53:20 2026
    From Newsgroup: comp.arch

    On 2/19/2026 2:02 AM, John Dallman wrote:
    In article <10n2u02$270jc$[email protected]>, [email protected] (Paul Clayton) wrote:

    I remember reading about the 8080 _ 8086 assembly translator. I
    did not know that CP/M and MS-DOS were similar enough to
    facilitate porting, so that note was interesting to me.

    /Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have hierarchical directories. It didn't really support hard disks. The
    CP/M-style APIs all carried on existing after MS-DOS 2.0 introduced a new
    set of APIs that were more suitable for high-level languages, but they weren't much used un new software.


    My own limited experience with MS-DOS programming mostly showed them
    using integer file-handles an a vaguely Unix-like interface for file IO
    at the "int 21h" level.

    Which is, ironically, in conflict with the "FILE *" interface used by
    C's stdio API.

    But, I do remember a mechanism involving shared FCB structs existing on MS-DOS, but AFAIK it was mostly unused in favor of the use of integer
    handles.

    But, it has been a very long time since I have messed around much with
    DOS (well, and in my childhood it was mostly 6.20 and 6.22 and similar,
    or 7.00 if using the version that came with Win95).


    Though, I have a vague memory of sometimes setting up chimeric versions
    of DOS, mostly intentionally installing 7.00 on top of 6.22 because
    there were a lot of additional programs that existed within a 6.22
    install that were absent 7.00.

    IIRC, in 6.22 it had QBASIC and EDIT was a thin wrapper over QBASIC,
    whereas 7.00 had dropped QBASIC and made EDIT self-contained, or
    something to this effect. So, in this sense, it made sense to install
    6.22 first and then 7.00 on top of it to get a DOS install that still
    had things like QBASIC and similar.


    Though, this was before switching over to dual-booting Slackware and
    NT4, and then running Cygwin on NT4 (was my general setup in
    middle-school before switching to Win2K in high-school). By that point,
    had mostly abandoned QBASIC (but, QBASIC was used some in elementary
    school).

    Well, apart from some vague (unconfirmed) memories of being exposed to
    Pascal via the "Mac Programmer's Workbench" thing at one point and being totally lost (was very confused, used a CLI but the CLI commands didn't
    make sense). Memory was like a Macintosh II with an external HDD and magneto-optical drive (*). Seemingly, the hardware exists and matches my memory, so presumably I saw it, but the memories also, doesn't make much sense.

    *: Like, it use disks that were sort of like giant versions of the 3.5" floppies holding a something resembling a rainbow-colored CD-ROM (disk protected by the sliding door). Or, say, if one took a 3.5" floppy and
    scaled it up to be a little larger than a 5.25" floppy (and if it held something resembling a rainbow-patterned CD-ROM).

    These things were a novelty as basically none of the other computers
    used them (everything else using normal floppies or CD-ROMs). Like, some
    sort of weird alien tech.


    Like, someone brought it in and had me try to use it (at the elementary school), and I didn't get it (like, it was confusing, and/or I was too
    stupid to use it at the time).

    Well, nothing like this ever happened again, I guess I had kinda blew it pretty hard at the time. The person took the computer with them and
    left. I am not sure why they came by (was managed by a guy who wasn't
    one of the usual teachers).


    Pretty much everything else ran DOS, apart from some Apple II/E and an
    Apple II/GS and similar (where the II/GS was kinda like the Mac, but
    with no MPW). The II/E's could also do BASIC, or one could boot up games
    like "Oregon Trail" and similar.

    Or, sorta timeline:
    Early years: Mostly played NES and watched TV.
    Like, mostly Super Mario Bros and similar.
    Well, and TV shows like "Captain N" and "Super Mario Super Show".
    Lots of NES related stuff going on in this era.
    Started elementary school:
    Distrurbing experience of being around other people;
    And like, none of them could read or similar (*1);
    Teacher was surprised, went and got librarian, ...
    Started messing around with Apple II/e's;
    Some BASIC on the II/e.
    Encounter with the guy with the Mac II;
    PCs were around, typically with 5.25" floppy drives;
    TV shows included things like the Sonic the Hedgehog cartoons.
    Also ReBoot and similar.
    Well, and "Star Trek: TNG".
    Parents got a PC: Had Win 3.11 and a CD-ROM drive.
    Mostly played games like Wolfenstein 3D and similar.
    QBASIC era started;
    Got my own PC;
    Started writing some stuff in real-mode ASM;
    Started moving from QBASIC to C;
    Windows 95 appeared;
    Moved (~ end of Elementary School era, following 6th grade);
    Tried using Win95, but it sucked.
    Jumped to NT4;
    Started Middle School.
    TV Shows in this era: "Star Trek: Voyager" and "Deep Space Nine".
    High School:
    Jumped to Windows 2000.
    "Star Trek: Enterprise" (but... it sucked...).
    ...

    *1: Was likely unexpected that I would make an issue about no one being
    able to read, but I think at the time, I was not particularly impressed
    by being shown cards with letters and similar, so...

    I don't remember my very early years though (my span of memory seemingly starting in the NES era).


    But, alas, my life since then was basically a failure...

    I guess back then, maybe people didn't realize it yet.

    Well, starting in middle school, stuff sucked a lot more. Just had to
    sit around classes and basically say nothing to draw undue attention,
    which sucked (was nicer when people just let me go off and mess with
    computers and similar). Still never really did much schoolwork though,
    just did tests when they came along (though, this strategy was an epic
    fail for college level calculus classes though... I just sorta figured I
    was too stupid for this stuff...).

    ...


    Intel presumably thought Itanium would be the only merchant
    64-bit ISA that mattered (and this would exclude AMD) and
    that the masses could use 32-bit until less expensive Itanium
    processors were possible.

    Pretty much. Then the struggle to make Itanium run fast became the overpowering concern, until they gave up and concentrated on x86-64,
    claiming that Itanium would be back in a few years.

    I don't think many people took that claim seriously. Some years later, an Intel marketing man was quite shocked to hear that, and that the world
    had simply been humouring them.


    In a way, it showed that they screwed up the design pretty hard that
    x86-64 ended up being the faster and more efficient option...

    I guess one question is if they had any other particular drawbacks other
    than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;
    ...


    I guess it is more of an open question of what would have happened, say,
    if Intel had gone for an ISA design more like ARM64 or RISC-V or something.

    These don't seem like they would have been too out-there, but then
    again, at the time there were also some of the WTF design choices that
    existed in MIPS and Alpha, so no guarantee it wouldn't have been screwed up.


    Well, or something like PowerPC, but then again, IBM still had
    difficulty keeping PPC competitive, so dunno. Then again, I think IBM's
    PPC issues were more related to trying to keep up in the chip fab race
    that was still going strong at the time, rather than an ISA design issue.


    I agree that such would add complexity, but there is already
    complexity for power saving with same ISA heterogeneity. NUMA-
    awareness, cache sharing, and cache warmth also complicate
    scheduling, so the question becomes how much extra complexity
    does such introduce.

    If the behaviour of Apple's OSes are any guide, complexity is avoided as
    far as possible.


    Unnecessary complexity is best avoided, as it often comes back to bite
    later.


    I still feel an attraction to a market-oriented resource
    management such that threads could both minimize resource use
    (that might be more beneficial to others) and get more than a
    fair-share of resources that are important.

    The difficulty there is that developers will have a very hard time
    creating /measurable/ speed-ups that apply across a wide range of
    different configurations. Companies will therefore be reluctant to put developer hours into it that could go into features that customers are
    asking for.

    John

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Thu Feb 19 19:59:11 2026
    From Newsgroup: comp.arch

    According to BGB <[email protected]>:
    /Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have
    hierarchical directories. ...

    My own limited experience with MS-DOS programming mostly showed them
    using integer file-handles an a vaguely Unix-like interface for file IO
    at the "int 21h" level.

    Yeah, Mark Zbikowski added them along with the tree structred file system in DOS 2.0.
    He was at Yale when I was, using a Unix 7th edition system I was supporting.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Feb 19 17:04:59 2026
    From Newsgroup: comp.arch

    On 2/19/2026 1:59 PM, John Levine wrote:
    According to BGB <[email protected]>:
    /Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have
    hierarchical directories. ...

    My own limited experience with MS-DOS programming mostly showed them
    using integer file-handles an a vaguely Unix-like interface for file IO
    at the "int 21h" level.

    Yeah, Mark Zbikowski added them along with the tree structred file system in DOS 2.0.
    He was at Yale when I was, using a Unix 7th edition system I was supporting.


    Looks it up...


    Yeah, my case, I didn't exist yet when the MS-DOS 2.x line came out...

    Did exist for the 3.x line though.
    I don't remember much from those years though.


    Some fragmentary memories implied that (in that era) had mostly been
    watching shows like Care Bears and similar (but, looking at it at a
    later age, found it mostly unwatchable). I think also shows like Smurfs
    and Ninja Turtles and similar, etc.

    Like, at some point, memory breaking down into sort of an amorphous mass
    of things from TV shows all just sort of got mashed together. Not much
    stable memory of things other than fragments of TV shows and such.


    Not sure what the experience is like for most people though.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Thu Feb 19 23:10:00 2026
    From Newsgroup: comp.arch


    My own limited experience with MS-DOS programming mostly showed
    them using integer file-handles an a vaguely Unix-like interface
    for file IO at the "int 21h" level.

    Which is, ironically, in conflict with the "FILE *" interface used
    by C's stdio API.

    However, it's entirely concordant with Unix's lower-level file
    descriptors, as used in the read() and write() calls.

    <https://en.wikipedia.org/wiki/File_descriptor> <https://en.wikipedia.org/wiki/Read_(system_call)>

    The FILE* interface is normally implemented on top of the lower-level
    calls, with a buffer in the process' address space, managed by the C
    run-time library. The file descriptor is normally a member of the FILE structure.

    MS-DOS is not a great design, but it isn't crazy either.

    Well, apart from some vague (unconfirmed) memories of being exposed
    to Pascal via the "Mac Programmer's Workbench" thing at one point
    and being totally lost (was very confused, used a CLI but the CLI
    commands didn't make sense).

    I used it very briefly. It was a very weird CLI, seemingly designed by
    someone opposed to the basic idea of a CLI.

    In a way, it showed that they screwed up the design pretty hard
    that x86-64 ended up being the faster and more efficient option...

    They did. They really did.

    I guess one question is if they had any other particular drawbacks
    other than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;

    Those huge register files had a lot to do with the low code density. They
    had two much bigger problems, though.

    They'd correctly understood that the low speed of affordable dynamic RAM
    as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads
    well in advance. They assumed, without evidence, that a compiler with
    plenty of time to think could schedule loads better than hardware doing
    it dynamically. It's an appealing idea, but it's wrong.

    It might be possible to do that effectively in a single-core,
    single-thread, single-task system that isn't taking many (if any)
    interrupts. In a multi-core system, running a complex operating system,
    several multi-threaded applications, and taking frequent interrupts and
    context switches, it is _not possible_. There is no knowledge of any of
    the interrupts, context switches or other applications at compile time,
    so the compiler has no idea what is in cache and what isn't. I don't
    understand why HP and Intel didn't realise this. It took me years, but I
    am no CPU designer.

    Speculative execution addresses that problem quite effectively. We don't
    have a better way, almost thirty years after Itanium design decisions
    were taken. They didn't want to do speculative execution, and they close
    an instruction format and register set that made adding it later hard. If
    it was ever tried, nothing was released that had it AFAIK.

    The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
    enough ILP to keep those pipelines fed, or they'd just eat cache capacity
    and memory bandwidth executing no-ops ... in a very bulky instruction set.
    They didn't have a general way to extract enough ILP. Nobody does, even
    now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.

    There was also an architectural misfeature with floating-point advance
    loads that could make them disappear entirely if there was a call
    instruction between an advance-load instruction and the corresponding check-load instruction. That cost me a couple of weeks working out and reporting the bug, which was unfixable. The only work-around was to
    re-issue all outstanding all floating-point advance-load instruction
    after each call returned. The effective code density went down further,
    and there were lots of extra read instructions issued.

    I guess it is more of an open question of what would have happened,
    say, if Intel had gone for an ISA design more like ARM64 or RISC-V
    or something.

    ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998. RISC-V has
    not demonstrated really high performance yet, and it's been around long
    enough that I'm starting to doubt it ever will.

    Well, or something like PowerPC, but then again, IBM still had
    difficulty keeping PPC competitive, so dunno. Then again, I think
    IBM's PPC issues were more related to trying to keep up in the chip
    fab race that was still going strong at the time, rather than an
    ISA design issue.

    I think that was fabs, rather than architecture. While I was providing libraries for PowerPC (strictly, POWER4, POWER5 and POWER6, one after
    another) it always had rather decent performance for its clockspeed and process.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Feb 20 00:06:35 2026
    From Newsgroup: comp.arch


    [email protected] (John Dallman) posted:


    They did. They really did.

    I guess one question is if they had any other particular drawbacks
    other than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;

    Those huge register files had a lot to do with the low code density. They
    had two much bigger problems, though.

    They'd correctly understood that the low speed of affordable dynamic RAM
    as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads well in advance. They assumed, without evidence, that a compiler with
    plenty of time to think could schedule loads better than hardware doing
    it dynamically. It's an appealing idea,

    possibly

    but it's wrong.

    at best we can say that their version failed to provide performance.
    The future may well prove it flat-out wrong at some point.

    It might be possible to do that effectively in a single-core,
    single-thread, single-task system that isn't taking many (if any)
    interrupts. In a multi-core system, running a complex operating system, several multi-threaded applications, and taking frequent interrupts and context switches, it is _not possible_. There is no knowledge of any of
    the interrupts, context switches or other applications at compile time,
    so the compiler has no idea what is in cache and what isn't. I don't understand why HP and Intel didn't realise this. It took me years, but I
    am no CPU designer.

    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out. Now, 30 years later
    the compilers are still in the position of having made LITTLE progress.

    I suspect a big part of the problem was tension between Intel and HP
    were the only political solution was allowing the architects from both
    sides to "dump in" their favorite ideas. A recipe for disaster.

    Speculative execution addresses that problem quite effectively. We don't
    have a better way, almost thirty years after Itanium design decisions
    were taken. They didn't want to do speculative execution, and they close
    an instruction format and register set that made adding it later hard. If
    it was ever tried, nothing was released that had it AFAIK.

    The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
    enough ILP to keep those pipelines fed, or they'd just eat cache capacity
    and memory bandwidth executing no-ops ... in a very bulky instruction set. They didn't have a general way to extract enough ILP. Nobody does,

    Reservation stations* provide such--but they do not use a multiplicity of
    in order pipelines.

    (*) and similar.

    even
    now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.

    There was also an architectural misfeature with floating-point advance
    loads that could make them disappear entirely if there was a call
    instruction between an advance-load instruction and the corresponding check-load instruction. That cost me a couple of weeks working out and reporting the bug, which was unfixable. The only work-around was to
    re-issue all outstanding all floating-point advance-load instruction
    after each call returned. The effective code density went down further,
    and there were lots of extra read instructions issued.

    LoL, I guess I am surprised that the same could not happen at interrupt
    or exception....

    I guess it is more of an open question of what would have happened,
    say, if Intel had gone for an ISA design more like ARM64 or RISC-V
    or something.

    ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998. RISC-V has
    not demonstrated really high performance yet, and it's been around long enough that I'm starting to doubt it ever will.

    In my humble opinion, there is a lot less wrong with ARM than RISC-V

    Well, or something like PowerPC, but then again, IBM still had
    difficulty keeping PPC competitive, so dunno. Then again, I think
    IBM's PPC issues were more related to trying to keep up in the chip
    fab race that was still going strong at the time, rather than an
    ISA design issue.

    I think that was fabs, rather than architecture.

    I suspect it was the cash-flow the product produced that limited
    "development" ... Whereas x86 and ARM have the kind of cash flow
    that allows/supports whatever the designers can invent that adds
    performance.

    While I was providing libraries for PowerPC (strictly, POWER4, POWER5 and POWER6, one after another) it always had rather decent performance for its clockspeed and process.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Thu Feb 19 22:35:45 2026
    From Newsgroup: comp.arch

    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Now, 30 years later the compilers are still in the position of having
    made LITTLE progress.

    And, to be honest, compiler people had been working on similar problems
    for 30 years already, so most compiler people aren't surprised that 30
    more made no significant difference.

    I suspect a big part of the problem was tension between Intel and HP
    were the only political solution was allowing the architects from both
    sides to "dump in" their favorite ideas. A recipe for disaster.

    The odd thing is that these were hardware companies betting on "someone
    else" solving their problem, yet if compiler people truly had managed to
    solve those problems, then other hardware companies could have taken
    advantage just as well.

    So from a commercial strategy it made very little sense.

    To me the main question is whether they were truly confused and just got
    lucky (lucky because they still managed to sell their idea enough that
    most RISC companies folded), or whether they truly understood that the
    actual technical success of the architecture didn't matter and that it
    was just a clever way to kill the RISC architectures.


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Fri Feb 20 15:14:08 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/19/2026 1:59 PM, John Levine wrote:
    According to BGB  <[email protected]>:
    /Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have >>>> hierarchical directories. ...

    My own limited experience with MS-DOS programming mostly showed them
    using integer file-handles an a vaguely Unix-like interface for file IO
    at the "int 21h" level.

    Yeah, Mark Zbikowski added them along with the tree structred file
    system in DOS 2.0.
    He was at Yale when I was, using a Unix 7th edition system I was
    supporting.


    Looks it up...


    Yeah, my case, I didn't exist yet when the MS-DOS 2.x line came out...

    Did exist for the 3.x line though.
    I don't remember much from those years though.


    Some fragmentary memories implied that (in that era) had mostly been watching shows like Care Bears and similar (but, looking at it at a
    later age, found it mostly unwatchable). I think also shows like Smurfs
    and Ninja Turtles and similar, etc.

    Like, at some point, memory breaking down into sort of an amorphous mass
    of things from TV shows all just sort of got mashed together. Not much > stable memory of things other than fragments of TV shows and such.


    Not sure what the experience is like for most people though.
    My memory from before the age of 4 is extremely spotty, just a couple of situations that made a lasting impact.
    By the time MSDOS 2.0 came out I had already handed in my MSEE thesis. :-) Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Feb 20 15:29:27 2026
    From Newsgroup: comp.arch

    On 2/19/2026 5:10 PM, John Dallman wrote:
    My own limited experience with MS-DOS programming mostly showed
    them using integer file-handles an a vaguely Unix-like interface
    for file IO at the "int 21h" level.

    Which is, ironically, in conflict with the "FILE *" interface used
    by C's stdio API.

    However, it's entirely concordant with Unix's lower-level file
    descriptors, as used in the read() and write() calls.

    <https://en.wikipedia.org/wiki/File_descriptor> <https://en.wikipedia.org/wiki/Read_(system_call)>

    The FILE* interface is normally implemented on top of the lower-level
    calls, with a buffer in the process' address space, managed by the C
    run-time library. The file descriptor is normally a member of the FILE structure.

    MS-DOS is not a great design, but it isn't crazy either.


    Yeah.


    Well, apart from some vague (unconfirmed) memories of being exposed
    to Pascal via the "Mac Programmer's Workbench" thing at one point
    and being totally lost (was very confused, used a CLI but the CLI
    commands didn't make sense).

    I used it very briefly. It was a very weird CLI, seemingly designed by someone opposed to the basic idea of a CLI.


    My vague memories was that its commands were just sort of straight up paradoxical. I don't remember much about them now though, other than
    being confused trying to look at them (or getting anything to happen).


    But, yeah, at my level of mental development at the time, whole
    experience was confusing. Also it using external drives (sorta like on
    the Apple II) but connected up with what looked like printer cables.

    But, I don't really know exactly why the guy with the computer showed
    up, or why he left, but he didn't seem pleased in any case.


    Exact timeline is fuzzy, but I do remember enough familiarity with
    MS-DOS to recognize they were almost completely different.

    And, unlike the Apple II/e, which had essentially used BASIC.


    But, either way, the experience (of MPW weirdness) was not something I
    would have been ready for at that stage of development.

    Well, and apparently a detail I missed in all of this, being that one
    didn't just do a SHIFT+RETURN, but it was apparently necessary to select
    the text for the command one wanted to run (with the mouse) before
    hitting SHIFT+RETURN (or, hitting the keys without selecting something
    first does nothing). Could be related to my difficulties/bewilderment at
    the time (compared with DOS, which was more like "type command and hit ENTER").

    Somehow I didn't remember anything about the "select command first"
    part. More seeing it like "click on the command window and do keyboard shortcuts" and then having it not work.


    But, I guess, some memories of mine, namely the thing of needing to do a ritual of dragging the drive to the trash-can and then also push a
    button on the front of the drive, is reasonably correct for those drives.

    Well, vs the 3.5" drive: Drag to trash, it ejects itself.

    Or, DOS/Windows/etc, wait until drive stops, press button to eject disk.
    Was very important in this case though to drag the drive to the trash
    and then wait for the light on the drive to go off, then press the eject button (and with a good solid press, the disk ejects).

    Also, it using a black-and-white monitor in an era where most others
    around were color (though with a typically lower screen resolution).



    Does seem like a sort of weird almost surreal memory.

    Does imply that my younger self was notable, and not seen as just some otherwise worthless nerd.

    Even if I totally failed at the tasks the guy had wanted from me.

    So, I was confused, and the guy left in frustration.



    In a way, it showed that they screwed up the design pretty hard
    that x86-64 ended up being the faster and more efficient option...

    They did. They really did.


    Yeah.


    I guess one question is if they had any other particular drawbacks
    other than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;

    Those huge register files had a lot to do with the low code density. They
    had two much bigger problems, though.

    They'd correctly understood that the low speed of affordable dynamic RAM
    as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads well in advance. They assumed, without evidence, that a compiler with
    plenty of time to think could schedule loads better than hardware doing
    it dynamically. It's an appealing idea, but it's wrong.


    My CPU core doesn't do speculative prefetch either, but this seems more
    like a "big OoO CPU" feature.

    There is a sort of form of very limited/naive prefetch, where if it
    guesses that one line of a like pair is likely to be followed by an
    access to the following line in the pair (via heuristics), it will
    prefetch the following line. This can help with things like linear
    memory walks.


    Could be better if there was a good/reliable way to detect linear walks.

    Say, ideal case would be that in linear walk scenarios, most of the
    memory fetches for the walk are via prefetches (while limiting the
    number of hard misses).

    For the L1 I$, one can assume linear walking by default.

    Though, arguably the effectiveness of a prefetch is reduced in cases
    where the hard-miss is likely to happen before the result of the
    prefetch arrives (even if it is an L2 hit), but does maybe give the L2
    cache a few cycles of "heads up" in the case of an L2 miss.


    In my case, as noted, I ended up using 64 registers, but can note:
    32 is near-optimal for generic code;
    Works well for 32-bit instruction words;
    64 deals better with high-pressure scenarios;
    Is a little tight for 32-bit instruction words;
    128 is likely invariably overkill
    Not particularly viable with 32-bit instruction words.


    Using register-paired types does result in "spikes" in register
    pressure, and is a strong case where supporting 64 registers makes sense
    (eg, so code generation doesn't get "owned"/"pwnt" when dealing with
    int128 or paired-128-bit-SIMD).

    Though, in the case of paired 128-bit ops, the even-registers-only rule
    does have a side benefit of allowing for use of 5-bit register fields
    while accessing all 64 registers (though still leaves a pain point when accessing one of the 64-bit halves of the pair, say if it happens to be
    on the "wrong side" in the case of an ISA like RISC-V).


    For 128 predicate registers, this part doesn't make as much sense:
    Typically, 1 predicate bit is sufficient;
    When exploring schemes for more advanced predication (Eg, 2 or 7/8
    predicate registers), they didn't really even hit break-even.

    Even if going for an IA64 like approach, probably made more sense to
    have gone with an 8-register config, say:
    P0: Hard-wired to 1/True
    P1..P7: Dynamic Predicates


    But, as noted, it was uncommon to find scenarios where having more than
    a single predicate bit offered enough of an advantage over 1 predicated
    bit to make it worthwhile, so the single-bit scheme seemed to remain the
    most viable (with some more complex scenarios instead using GPRs for
    boolean logic tasks, even if using a GPR for boolean logic tasks is
    arguably wasteful).

    For XG3, had ended up with a scenario where directing Boolean operations
    to X0/RO was understood as updating the predicate bit:
    SLT, SGE, SEQ, SNE: Rd=X0, Sets/Clears SR.T
    AND/OR: Rd=X0, Also modifies SR.T (understood as a Boolean op).
    Contrast (Rd=X0):
    ADD/ADDI: NOP
    LHU/LWU: Reserved for Mode-Hops (XG3 supported) / NOP (unsupported).
    LHU: Jumps to RV64GC Mode (behaves like a JALR with Rd=X0)
    LWU: Jumps to XG3 Mode (behaves like a JALR with Rd=X0)
    Both being fall-through if XG3 is not supported.
    If it doesn't branch, it means only RISC-V ops are supported.


    Currently, the detection features are not used, as they only really make
    sense in a mixed-mode binary that could potentially be used on a plain
    RISC-V target.

    But, in other contexts, the typical pattern is to use pointer tagging,
    where:
    (0)=0: Jump to an address within the same mode, (63:48) ignored
    (1)=1: Jump with possible mode change, (63:48)=mode

    One other special feature is that the mode bits also encode a tag, which
    can be used to mark a pointer with the current process (with a value
    assigned by an RNG), with the LSB also being required to be set, if Rs1==X1.

    This can be used to add resistance against stack-stomping via buffer overflows, but is potentially risky with RISC-V:
    AUIPC X1, AddrHi
    JALR X0, AddrLo(X1)
    Can nuke the process, when officially it is allowed (vs forcing the use
    of a different register to encode a long branch).

    Where, for other contexts, AUIPC would necessarily need to produce an
    untagged address.


    It might be possible to do that effectively in a single-core,
    single-thread, single-task system that isn't taking many (if any)
    interrupts. In a multi-core system, running a complex operating system, several multi-threaded applications, and taking frequent interrupts and context switches, it is _not possible_. There is no knowledge of any of
    the interrupts, context switches or other applications at compile time,
    so the compiler has no idea what is in cache and what isn't. I don't understand why HP and Intel didn't realise this. It took me years, but I
    am no CPU designer.


    No idea there, but either way, seems like a difficult problem.


    Speculative execution addresses that problem quite effectively. We don't
    have a better way, almost thirty years after Itanium design decisions
    were taken. They didn't want to do speculative execution, and they close
    an instruction format and register set that made adding it later hard. If
    it was ever tried, nothing was released that had it AFAIK.

    The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
    enough ILP to keep those pipelines fed, or they'd just eat cache capacity
    and memory bandwidth executing no-ops ... in a very bulky instruction set. They didn't have a general way to extract enough ILP. Nobody does, even
    now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.


    Yeah...

    In my case, there is only 1 pipeline per core for now.
    But ISA is still mostly RISC-like.


    Not so much the 128-bits with 3-instructions thing, and then needing to
    NOP pad if one can't find 3 useful instructions which fit into the pipeline.

    My compiler would probably also be pretty awful if trying to target IA64.


    Though did get around to re-adding a repurposed version of the WEXifier
    for XG3 and RV, though its purpose was a little different in that these
    ISA's have no way to flag for parallel execution, so the purpose is more
    to shuffle instructions around to try to reduce register-RAW
    dependencies and to help out the in-order superscalar stuff.


    There was also an architectural misfeature with floating-point advance
    loads that could make them disappear entirely if there was a call
    instruction between an advance-load instruction and the corresponding check-load instruction. That cost me a couple of weeks working out and reporting the bug, which was unfixable. The only work-around was to
    re-issue all outstanding all floating-point advance-load instruction
    after each call returned. The effective code density went down further,
    and there were lots of extra read instructions issued.

    I guess it is more of an open question of what would have happened,
    say, if Intel had gone for an ISA design more like ARM64 or RISC-V
    or something.

    ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998. RISC-V has
    not demonstrated really high performance yet, and it's been around long enough that I'm starting to doubt it ever will.


    There seem to be some questionable design choices here, and also a lot
    of foot dragging for things that could help.


    They also seem to be relatively focused on the assumption of CPUs having low-latency ALU and Memory-Load ops, which seems like a dangerous
    assumption to make.


    Like, how about one not try to bake in assumptions about 1-cycle ALU and 2-cycle Load being practical?...

    Vs, say, 2-cycle ALU ops and 3-cycle Loads; with an ideal of putting 5 instructions between an instruction that generates a result and the instruction that consumes the result as this is more likely to work with in-order superscalar.


    But, then one runs into the issue that if a basic operation then
    requires a multi-op sequence, the implied latency goes up considerably
    (say, could call this "soft latency", or SL).

    So, for example, it means that, say:
    2-instruction sign extension:
    RV working assumption: 2 cycles
    Hard latency (2c ALU): 4 cycles
    Soft latency: 12 cycles.
    For a 3-op sequence, the effective soft-latency goes up to 18, ...

    And, in cases where the soft-latency significantly exceeds the total
    length of the loop body, it is no longer viable to schedule the loop efficiently.

    So, in this case, an indexed-load instruction has an effective 9c SL,
    whereas SLLI+ADD+LD has a 21 cycle SL.


    where, in this case, the goal of something like the WEXifier is to
    minimize this soft-latency cost (in cases where a dependency is seen,
    any remaining soft-latency is counted as penalty).

    But, then again, maybe the concept of this sort of "soft latency" seems
    a bit alien.


    Granted, not sure how this maps over to OoO, but had noted that even
    with modern CPUs, there still seems to be benefit from assuming a sort
    of implicit high latency for instructions over assuming a lower latency.



    Well, or something like PowerPC, but then again, IBM still had
    difficulty keeping PPC competitive, so dunno. Then again, I think
    IBM's PPC issues were more related to trying to keep up in the chip
    fab race that was still going strong at the time, rather than an
    ISA design issue.

    I think that was fabs, rather than architecture. While I was providing libraries for PowerPC (strictly, POWER4, POWER5 and POWER6, one after another) it always had rather decent performance for its clockspeed and process.


    OK.

    I guess it is a question here of what if IBM had outsourced their fab
    stuff earlier.

    Though, there is still the potential downside of licensing based
    production (say, if they went for something more like in the ARM model),
    which is possibly worse than the argued threat of vendor-based market fragmentation (the usual counter-argument against RISc-V, *1).

    *1: Where people argue that if each vendor can do a CPU with their own
    custom ISA variants and without needing to license or get approval from
    a central authority, that invariably everything would decay into an
    incoherent mess where there is no binary compatibility between
    processors from different vendors (usual implication being that people
    are then better off staying within the ARM ecosystem to avoid RV's lawlessness).


    John


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Feb 20 23:49:54 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 2/19/2026 5:10 PM, John Dallman wrote: ------------------------------------
    This can be used to add resistance against stack-stomping via buffer overflows, but is potentially risky with RISC-V:
    AUIPC X1, AddrHi
    JALR X0, AddrLo(X1)
    Can nuke the process, when officially it is allowed (vs forcing the use
    of a different register to encode a long branch).

    That should be:
    AUPIC x1,hi(offset)
    JALR x0,lo(offset)

    using:
    SETHI x1,AddrHi
    JALR x0,AddrLo

    would work.

    ---------------------
    Like, how about one not try to bake in assumptions about 1-cycle ALU and 2-cycle Load being practical?...

    for the above to work::
    ALU is < ½ cycle leaving ¼ cycle output drive and ¼ cycle input mux
    SRAM is ½ cycle, AGEN to SRAM decode is ¼ cycle, SRAM output to shifter
    is < ¼ cycle, and set-selection is ½ cycle; leaving ¼ cycle for output drive.

    Vs, say, 2-cycle ALU ops and 3-cycle Loads; with an ideal of putting 5 instructions between an instruction that generates a result and the instruction that consumes the result as this is more likely to work with in-order superscalar.

    1-cycle ALU with 3 cycle LD is not very hard at 16-gates per cycle.
    2-cycle LD is absolutely impossible with 1-cycle addr-in to data-out
    SRAM. So, we generally consider any design with 2-cycle LD to be
    frequency limited.

    But, then one runs into the issue that if a basic operation then
    requires a multi-op sequence, the implied latency goes up considerably
    (say, could call this "soft latency", or SL).

    So, for example, it means that, say:
    2-instruction sign extension:
    RV working assumption: 2 cycles
    Hard latency (2c ALU): 4 cycles
    Soft latency: 12 cycles.
    For a 3-op sequence, the effective soft-latency goes up to 18, ...

    One of the reasons a 16-gate design works better in practice than
    a 12-gate design. And why a 1-cycle ALU, 3-cycle LD runs at higher
    frequency.

    And, in cases where the soft-latency significantly exceeds the total
    length of the loop body, it is no longer viable to schedule the loop efficiently.

    In software, there remains no significant problem running the loop
    in HW.

    So, in this case, an indexed-load instruction has an effective 9c SL, whereas SLLI+ADD+LD has a 21 cycle SL.

    3-cycle indexed LD with cache hit in may µArchitectures--with scaled
    indexing. This is one of the driving influences of "raising" the
    semantic content of LD/ST instructions to [Rbase+Rindex<<sc+Disp]

    where, in this case, the goal of something like the WEXifier is to
    minimize this soft-latency cost (in cases where a dependency is seen,
    any remaining soft-latency is counted as penalty).

    But, then again, maybe the concept of this sort of "soft latency" seems
    a bit alien.

    Those ISAs without scaled indexing have longer effective latency through
    cache than those with: those without full range Dsip have similar problems: those without both are effectively adding 3-4 cycles to LD latency.

    Which is why the size of the execution windows grew from 60-ish to 300-ish
    to double performance--the ISA is adding latency and the size of execution window is the easiest way to absorb such latency.
    {{60-ish ~= Athlon; 300-ish ~= M4}}

    Granted, not sure how this maps over to OoO, but had noted that even
    with modern CPUs, there still seems to be benefit from assuming a sort
    of implicit high latency for instructions over assuming a lower latency.

    Execution window size is how it maps.

    *1: Where people argue that if each vendor can do a CPU with their own custom ISA variants and without needing to license or get approval from
    a central authority, that invariably everything would decay into an incoherent mess where there is no binary compatibility between
    processors from different vendors (usual implication being that people
    are then better off staying within the ARM ecosystem to avoid RV's lawlessness).

    RISC-V seems to be "eating" a year (or a bit more) to bring this mess into
    a coherent framework.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sat Feb 21 01:00:05 2026
    From Newsgroup: comp.arch

    On 2/20/2026 5:49 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/19/2026 5:10 PM, John Dallman wrote:
    ------------------------------------
    This can be used to add resistance against stack-stomping via buffer
    overflows, but is potentially risky with RISC-V:
    AUIPC X1, AddrHi
    JALR X0, AddrLo(X1)
    Can nuke the process, when officially it is allowed (vs forcing the use
    of a different register to encode a long branch).

    That should be:
    AUPIC x1,hi(offset)
    JALR x0,lo(offset)

    using:
    SETHI x1,AddrHi
    JALR x0,AddrLo

    would work.


    Usual notation seems to be that AUIPC uses a direct-immediate notation
    (eg "AUIPC X1, 0x12345"), and "JALR X0, 0x678(X1)".

    Though, GAS and friends can use:
    AUIPC X1, %hi(symbol)
    JALR X0, X1, %lo(symbol)

    Well, and for JALR:
    JALR Xd, Disp(Xs)
    JALR Xd, Xs, Disp
    Being basically equivalent.
    ...

    If expressing it using a symbol rather than a literal displacement...

    But, either way, it is using X1 that was the relevant point here, which
    is technically allowed in RISC-V, but would explode if one tries to
    constrain X1 to being used as a link register and then also uses
    enforced tag checking in this case.

    In BGBCC's native ASM notation, the symbol case would typically be
    expressed as:
    BRA symbol
    Which them implies the 2-op form if the "symbol is within +/- 1MB check" fails. But, would differ in that this pseudo-op will stomp X5.



    But, yeah, RISC-V ASM notation conventions seem to get a little
    confusing sometimes...

    But, errm, my point wasn't so much about RISC-V's ASM syntax patterns.


    There is a non-zero risk though when one disallows uses that are
    theoretically allowed in the ISA, even if GCC doesn't use them.


    Though, the reason to sanity-check X1 is that it is pretty much
    universally used as the link register, and sanitizing the link-register
    can be used to trap on potential stack-corruption in buffer overflow
    exploits (more so with a compiler that tends not to use stack canary
    checks).


    Well, and in terms of typical ASM notation, there is this mess:
    (Rb) / @Rb / @(Rb) //load/store register
    (Rb, Disp) / Disp(Rb) //load/store disp
    @(Rb, Disp) / @(Disp, Rb) //load/store disp (but with @)
    Then:
    (Rb, Ri) //indexed (element sized index)
    Ri(Rb) //indexed (byte-scaled index)
    (Rb, Ri, Sc) //indexed with scale
    Disp(Rb, Ri) //indexed with displacement
    Disp(Rb, Ri, Sc) //indexed with displacement and scale
    Then:
    @Rb+ / (Rb)+ //post-increment
    @-Rb / -(Rb) //pre-decrement
    @Rb- / (Rb)- //post-decrement
    @+Rb / +(Rb) //pre-increment

    And, in some variants, all the registers prefixed with '%'.

    Comparably, the Intel style notation is more consistent, but don't
    necessarily want to also throw Intel notation into this particular mix.


    Well, more so as there is an implicit visual hint, say in x86:
    movl 128(%ebx), %eax
    mov eax, [ebx+128]
    Where the notation partly also keys one into the register ordering, but
    if on had Intel style memory notation while using AT&T style ordering,
    this would be a problem (confusing mess).


    Well, or the other messy feature that BGBCC tries to infer the register
    order based on which nmemonics are used:
    OP Rd, Rs1, Rs2 //used in RV mnemonics dominate
    OP Rs1, Rs2, Rd //otherwise

    Likely, if [] notation were supported then it would likely signal "dest, source" ordering (like Intel x86, and ARM), though in this case
    [Rb+Disp] and [Rb,Disp] likely being treated as analogous.


    But, alas, kind of a mess...

    And, if trying to mix/match styles, "there be dragons here"...



    ---------------------
    Like, how about one not try to bake in assumptions about 1-cycle ALU and
    2-cycle Load being practical?...

    for the above to work::
    ALU is < ½ cycle leaving ¼ cycle output drive and ¼ cycle input mux
    SRAM is ½ cycle, AGEN to SRAM decode is ¼ cycle, SRAM output to shifter
    is < ¼ cycle, and set-selection is ½ cycle; leaving ¼ cycle for output drive.

    Vs, say, 2-cycle ALU ops and 3-cycle Loads; with an ideal of putting 5
    instructions between an instruction that generates a result and the
    instruction that consumes the result as this is more likely to work with
    in-order superscalar.

    1-cycle ALU with 3 cycle LD is not very hard at 16-gates per cycle.
    2-cycle LD is absolutely impossible with 1-cycle addr-in to data-out
    SRAM. So, we generally consider any design with 2-cycle LD to be
    frequency limited.


    My stuff mostly assumes:
    ADD and similar: 2 cycles
    Load: 3 cycles.

    In this case, some 1 cycle ops exist:
    MOV Rs, Rd / MV Xd, Xs
    MOV Imm, Rd / LI Xd, Imm

    For the RV and XG3 decoders, some special instructions are decoded as
    one of the above:
    ADDI Xd, Xs, 0 => MV
    ADDI Xd, X0, Imm => LI

    But, most remain as 2/3 cycle.

    A few instructions had a 4 cycle latency, mostly those which combined a
    Load with a format-conversion or similar.


    But, then one runs into the issue that if a basic operation then
    requires a multi-op sequence, the implied latency goes up considerably
    (say, could call this "soft latency", or SL).

    So, for example, it means that, say:
    2-instruction sign extension:
    RV working assumption: 2 cycles
    Hard latency (2c ALU): 4 cycles
    Soft latency: 12 cycles.
    For a 3-op sequence, the effective soft-latency goes up to 18, ...

    One of the reasons a 16-gate design works better in practice than
    a 12-gate design. And why a 1-cycle ALU, 3-cycle LD runs at higher
    frequency.


    OK.

    I ended up going for a slightly lower clock speed and slightly more
    complex operations because often this resulted in better performance.

    And, while I could probably run an RV32IM core at 100 MHz, I would need
    to pay in other areas.


    And, in cases where the soft-latency significantly exceeds the total
    length of the loop body, it is no longer viable to schedule the loop
    efficiently.

    In software, there remains no significant problem running the loop
    in HW.


    Another traditional option is modulo scheduling, but actually doing so
    in the compiler is more complex (and BGBCC does not do so).

    Can often do a "poor man's" version in C, which while a less elegant
    solution, often works out better in practice.


    One can try to effectively unroll the loop enough that the latency can
    be covered efficiently, but then the issue may become one of running out
    of working registers, and one doesn't want to unroll so much that the
    code starts thrashing, which tends to hurt worse than the potential loss
    of ILP by being narrower.



    So, in this case, an indexed-load instruction has an effective 9c SL,
    whereas SLLI+ADD+LD has a 21 cycle SL.

    3-cycle indexed LD with cache hit in may µArchitectures--with scaled indexing. This is one of the driving influences of "raising" the
    semantic content of LD/ST instructions to [Rbase+Rindex<<sc+Disp]


    Yeah, pretty much, or at lease [Rb+Ri<<Sc], but more of the full [Rb+Ri<<Sc+Disp] scenario often being uncommon IME.

    Well, with a possible exception of [GP+Ri<<Sc+Disp] which would see a localized spike due to:
    someGlobalArray[index]

    As-is, this case tends to manifest in my case as, say:
    LEA.Q (GP, Disp), R5
    MOV.Q (R5, R10), R11


    where, in this case, the goal of something like the WEXifier is to
    minimize this soft-latency cost (in cases where a dependency is seen,
    any remaining soft-latency is counted as penalty).

    But, then again, maybe the concept of this sort of "soft latency" seems
    a bit alien.

    Those ISAs without scaled indexing have longer effective latency through cache than those with: those without full range Dsip have similar problems: those without both are effectively adding 3-4 cycles to LD latency.

    Which is why the size of the execution windows grew from 60-ish to 300-ish
    to double performance--the ISA is adding latency and the size of execution window is the easiest way to absorb such latency.
    {{60-ish ~= Athlon; 300-ish ~= M4}}


    OK.

    FWIW, there are reasons I have indexed addressing and jumbo-prefixes for larger immediate values and displacements.

    But, seemingly, the idea of deviating from 2R1W and 16/32 instruction encodings fills the RISC-V people with fear.



    Granted, not sure how this maps over to OoO, but had noted that even
    with modern CPUs, there still seems to be benefit from assuming a sort
    of implicit high latency for instructions over assuming a lower latency.

    Execution window size is how it maps.


    OK.


    *1: Where people argue that if each vendor can do a CPU with their own
    custom ISA variants and without needing to license or get approval from
    a central authority, that invariably everything would decay into an
    incoherent mess where there is no binary compatibility between
    processors from different vendors (usual implication being that people
    are then better off staying within the ARM ecosystem to avoid RV's
    lawlessness).

    RISC-V seems to be "eating" a year (or a bit more) to bring this mess into
    a coherent framework.

    Yeah, and while ARC drags their feet,
    Qualcomm/Huawei/ByteDance/T-Head/... each go off and do similar things
    but in different ways...

    If I were organizing it, would likely handle it differently, by having a nested structure:
    Formal / Frozen //parts of the ISA that are fully settled.
    Semi-formal / non-frozen //details subject to change.
    provisional / experimental //very unstable.
    vendor-specific //excluded from standardization.


    In the provisional space, encodings could be defined, but could be
    reclaimed if the feature is "dead"; but would be in encoding blocks
    where they could be standardized later.

    The main difference being that the provisional space, there would be a semi-official website listing registered encodings. Rather than these encodings being scattered in the ISA documentation for the various
    vendor processors (requiring digging though a bunch of PDFs and so on to
    try to figure out which encodings are already in-use).


    Then sometimes there are encodings that are defined in way that don't
    make sense, like apparently there is a RISC-V core from MIPS
    Technologies, where they went and added Load/Store Pair, but with two
    data source/dest register fields and a very small displacement.

    This contrast strongly with, say, having even-pair registers and a
    non-tiny displacement (displacement needs to be at least big enough to
    cover a typical load/store area for a prolog/epilog).


    I had at first reused LDU/SDU encodings, but then the proposal for
    Load/Store indexed didn't go with my encoding scheme, but a different
    one that also used LDU/SDU (but, maybe this is ultimately a better place
    to put them; vs my approach of shoving them in an odd corner within the
    'A' extension's block).



    I ended up migrating Load/Store pair to the FLQ/FSQ encodings, partly as
    I had no intention to implement the Q extension as-is (and I needed
    somewhere to relocate them to). But, then this led to the "Pseudo Q" idea.


    In my case, there is XG3, but I consider this in a different category
    as, while it retains compatibility with RV64G and partial with RV64GC
    (mostly by providing for interoperability); it is in some ways a notable departure from "pure" RISC-V (well, in that modes, tagged pointers, and swapping out RV-C encodings for a different 32-bit encoding space, are
    not particularly small additions if one didn't already have a CPU
    designed in this way).


    ...





    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat Feb 21 16:18:11 2026
    From Newsgroup: comp.arch

    [email protected] (John Dallman) writes:
    [some unattributed source writes:]
    In a way, it showed that they screwed up the design pretty hard
    that x86-64 ended up being the faster and more efficient option...

    They did. They really did.

    I guess one question is if they had any other particular drawbacks
    other than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;

    IA-64 certainly had significantly larger code sizes than others, but I
    think that they expected it and found it acceptable. And a
    software-pipelined loop on IA-64 is probably smaller than an
    auto-vectorized loop on AMD64+AVX, thanks to their predication
    features that allowed them to avoid extra code for the ramp-up and
    ramp-down of the software-pipelined loops, whereas auto-vectorized
    code usually requires ramp-down, plus quite a bit of other overhead.
    Clang seems to be significantly better at keeping such loops small
    than gcc.

    They
    had two much bigger problems, though.

    They'd correctly understood that the low speed of affordable dynamic RAM
    as compared to CPUs running at hundreds of MHz was the biggest barrier to >making code run fast.

    They may have "understood" it, as have many others who have read
    "Hitting the memory wall", but I have disputed in 2001 that this is
    correct in general:
    https://www.complang.tuwien.ac.at/anton/memory-wall.html

    [Looking at that text again, I see the claims "On current
    high-performance machines this time [for compulsory cache misses] is
    limited to a few seconds if the process stays in physical RAM" and
    "Memory bandwidth tends to grow with memory size, so this won't change
    much"; as it happens, I recently measured the memory bandwidth of my
    PC with a Ryzen 8700G and 64GB RAM (whereas my machine in 2001 had
    192MB): it is 64GB/s when read sequentially, i.e., it can read all its
    RAM in 1s. For random accesses, performance is significantly worse,
    however.]

    McKee has written a retrospective of the paper in 2004 <http://svmoore.pbworks.com/w/file/fetch/59055930/p162-mckee.pdf>, but
    has not cited my reaction (but she apparently had a wealth of
    reactions to select from, so that's excusable),

    Anyway, my point is that there are many programs that are not
    memory-bound, and my convenience benchmarks I usually use (a LaTeX run
    and Gforth's small benchmarks) are among them. Itanium II sucks on
    them compared to its contemporary (or even older) competition; numbers
    are times in seconds (lower means faster):

    LaTeX <https://www.complang.tuwien.ac.at/franz/latex-bench>
    - HP workstation 900MHz Itanium II, Debian Linux 3.528
    - Athlon (Thunderbird) 1200C, VIA KT133A, PC133 SDRAM, RedHat7.1 1.68
    - Pentium 4 2.66GHz, 512KB L2, Debian 3.1 1.31
    - Athlon 64 3200+, 2000MHz, 1MB L2, Fedora Core 1 (64-bit) 0.76

    Gforth <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/Benchres>:
    sieve bubble matrix fib fft release; CPU; gcc
    0.708 0.780 0.484 1.028 0.552 20250201 Itanium II 900MHz (HP rx2600); gcc-4.3.2
    0.192 0.276 0.108 0.360 0.7.0; K8 2GHz (Opteron 270); gcc-4.3.1

    Their solution was have the compiler schedule loads
    well in advance. They assumed, without evidence, that a compiler with
    plenty of time to think could schedule loads better than hardware doing
    it dynamically. It's an appealing idea, but it's wrong.

    Actually, other architectures also added prefetching instructions for
    dealing with that problem. All I have read about that was that there
    were many disappointments when using these instructions. I don't know
    if there were any successes, and how frequent they were compared to
    the disappointments. So I don't see that IA-64 was any different from
    other architectures in that respect.

    My explanations for the disappointments are:

    * Hardware prefetchers (typically stride-based) already cover a lot of
    the cases where prefetch instructions would otherise help. In that
    case memory-bound programs are limited by bandwidth, not by latency.

    * When dealing with pointer-chasing in linked lists and such, unless
    the linked list entries usually happen to be a fixed stride apart
    (where the stride predictor helps), the pointer-chasing limits the
    performance in any case; for an in-order uarch, prefetching or
    scheduling the fetching of the next pointer ahead may save a cycles,
    but typically few compared to the memory access latency; for OoO
    uarchs, OoO usually will result in a bunch of instructions (among
    them the load of the next pointer) waiting for the result of the RAM
    access).

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It might be possible to do that effectively in a single-core,
    single-thread, single-task system that isn't taking many (if any)
    interrupts. In a multi-core system, running a complex operating system, >several multi-threaded applications, and taking frequent interrupts and >context switches, it is _not possible_. There is no knowledge of any of
    the interrupts, context switches or other applications at compile time,
    so the compiler has no idea what is in cache and what isn't.

    I don't think that's a big issue. Sure, a context switch in one core
    results in the need to refill most of the L1 and L2 caches, and maybe
    even a part of the L3 cache, but that's the cost of the context
    switches, and the prefetching (whether by the compiler or the
    programmer) is usually not designed to mitigate that. That's because
    context switches are rare; e.g., when I run an 8-thread DGEMM (using
    libopenmp) for 16000x16000 matrices on my 8-core desktop system (with
    other stuff working (but consuming little CPU) at the same time, I see:

    # perf stat -e sched:sched_switch,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions openblas 16000

    Performance counter stats for 'openblas 16000':

    14811 sched:sched_switch # 82.470 /sec
    179592.12 msec task-clock # 7.444 CPUs utilized
    14811 context-switches # 82.470 /sec
    5 cpu-migrations # 0.028 /sec
    9789 page-faults # 54.507 /sec
    548255317659 cycles # 3.053 GHz
    936546607037 instructions # 1.71 insn per cycle

    24.125989128 seconds time elapsed

    173.502827000 seconds user
    6.016053000 seconds sys

    I.e., one context switch every 37M cycles (or 63M instructions).
    Refilling the 1MB L2 (Zen4) from RAM at 64GB/s costs about 47k cycles,
    i.e., less than 0.2% of the time between context switches.

    [You may wonder why the processor is only running at 3GHz; I have
    power-limited it to 25W; it tends to do the same task close to 5GHz in
    its default setting (88W power limit).]

    Speculative execution addresses that problem quite effectively.

    I have certainly read about interesting results for binary search (in
    an array) where the branching version outperforms the branchless
    version because the branching version speculatively executes several
    followup loads, and even in unpredictable cases the speculation should
    be right in 50% of the cases, resulting in an effective doubling of memory-level parallelism (but at a much higher cost in memory
    subsystem load). But other than that, I don't see that speculative
    execution helps with memory latency.

    OoO helps in several ways: it will do some work in the shadow of the
    load (although the utilization will still be abysmal even with
    present-day schedulers and ROBs [1]); but more importantly, it can
    dispatch additional loads that may also miss the cache, resulting in
    more memory-level parallelism. To some extent that also happens with
    in-order uarchs where the in-order aspect only takes effect when using
    the result of an instruction, but OoO provides more cases (especially
    if you do not know which loads will miss, which often is the case).

    [1] With 50ns DRAM latency (somewhat optimistic) and 5GHz CPU with 6
    rename slots per cycle, 1500 instructions could be executed (at its
    most extreme) while a load is served from DRAM, but the scheduling
    (and non-scheduler) slots tend to be ~100, and the ROB entries tend to
    be ~500.

    But I don't see that multi-core, multi-threaded, multi-tasking makes a significant difference here.

    We don't
    have a better way, almost thirty years after Itanium design decisions
    were taken. They didn't want to do speculative execution

    They wanted to do it (and did it) in the compiler; the corresponding architectural feature is IIRC the advanced load.

    and they close
    an instruction format and register set that made adding it later hard.

    The instruction format makes no difference. Having so many registers
    may have mad it harder than otherwise, but SPARC also used many
    registers, and we have seen OoO implementations (and discussed them a
    few months ago). The issue is that speculative execution and OoO
    makes all the EPIC features of IA-64 unnecessary, so if they cannot do
    a fast in-order implementation of IA-64 (and they could not), they
    should just give up and switch to an architecture without these
    features, such as AMD64. And Intel did, after a few years of denying.
    The remainder of IA-64 was just to keep it alive for the customers who
    had bought into it.

    The other problem was that they had three (or six, or twelve) in-order >pipelines running in parallel. That meant the compilers had to provide
    enough ILP to keep those pipelines fed, or they'd just eat cache capacity
    and memory bandwidth executing no-ops ...

    No, IA-64 has groups (sets of instructions that are allowed to start
    within the same cycle) down to one instruction. Many people think
    that the bundles (3 instructions encoded in 128 bits) are the same as
    groups, but that's not the case. There can be stops (boundaries
    between groups) within a bundle. The only thing the bundle format
    limits is that branch targets may only start at a 128-bit boundary.

    The IA-64 instruction format is still not particularly compact, but
    it's not as bad as you indicate.

    They didn't have a general way to extract enough ILP.

    For a few cases, they have. But the problem is that these cases are
    also vectorizable.

    Their major problem was that they did not get enough ILP from
    general-purpose code.

    And even where they had enough ILP, OoO CPUs with SIMD extensions ate
    their lunch.

    I guess it is more of an open question of what would have happened,
    say, if Intel had gone for an ISA design more like ARM64 or RISC-V
    or something.

    ARM64 seems to me to be the product of a lot more experience with >speculatively-executing processors than was available in 1998.

    OoO processors with speculative execution have been widely available
    since 1995 (Pentium Pro, PA 8000; actually PPC 603 preceded that in
    1994, but made less of a splash than the Pentium Pro, and the ES/9000
    H2 was available in 1991, but was confined to the IBM world; did they
    ever submit SPEC results for that?). And the experience is that EPIC architectural features (beyond SIMD) are unnecessary, and the result
    of that is indeed ARM64, but also AMD64 and RISC-V.

    RISC-V has
    not demonstrated really high performance yet, and it's been around long >enough that I'm starting to doubt it ever will.

    In a world where we see convergence on fewer and fewer architecture
    styles and on fewer and fewer architectures, you only see the
    investment necessary for high-performance implementations of a new
    architecture if there is a very good reason not to use one of the
    established architectures (for ARM T32 and ARM A64 the smartphone
    market was that reason). It may be that politics will provide that
    reason for another architecture, but even then it's hard. But RISC-V
    seems to have the most mindshare among the alternatives, so if any
    architecture will catch up, it looks like the best bet.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sat Feb 21 18:41:12 2026
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    That did not solve memory latency, but that's a problem even for OoO
    cores.

    I suspect a big part of the problem was tension between Intel and HP
    were the only political solution was allowing the architects from both
    sides to "dump in" their favorite ideas. A recipe for disaster.

    The HP side had people like Bob Rau (Cydrome) and Josh Fisher
    (Multiflow), and given their premise, the architecture is ok; somewhat
    on the complex side, but they wanted to cover all the good ideas from
    earlier designs; after all, it was to be the one architecture to rule
    them all (especially performancewise). You cannot leave out a feature
    that a competitor could then add to outperform IA-64.

    The major problem was that the premise was wrong. They assumed that
    in-order would give them a clock rate edge, but that was not the case,
    right from the start (The 1GHz Itanium II (released July 2002)
    competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
    XP (released June 2002)). They also assumed that explicit parallelism
    would provide at least as much ILP as hardware scheduling of OoO CPUs,
    but that was not the case for general-purpose code, and in any case,
    they needed a lot of additional ILP to make up for their clock speed disadvantage.

    The odd thing is that these were hardware companies betting on "someone
    else" solving their problem, yet if compiler people truly had managed to >solve those problems, then other hardware companies could have taken >advantage just as well.

    I am sure they had patents on stuff like the advanced load and the
    ALAT, so no, other hardware companies would have had a hard time.

    To me the main question is whether they were truly confused and just got >lucky (lucky because they still managed to sell their idea enough that
    most RISC companies folded),

    I think most RISC companies had troubles scaling. They were used to
    small design teams spinning out simple RISCs in a short time, and did
    not have the organization to deal with the much larger projects that
    OoO superscalars required. And while everybody inventing their own architecture may have looked like a good idea when developing an
    architecture and its implementations was cheap, it looked like a bad
    deal when development costs started to ramp up in the mid-90s. That's
    why HP went to Intel, and other companies (in particular, SGI) took
    this as an exit strategy from the own-RISC business.

    DEC had increasing delays in their chips, and eventually could not
    make enough money with them and had to sell themselves to Compaq (who
    also could not sustain the effort and sold themselves to HP (who
    canceled Alpha development)). I doubt that IA-64 played a big role in
    that game.

    Back to IA-64: At the time, when OoO was just starting, the premise of
    IA-64 looked plausible. Why wouldn't they see a fast clock rate and
    higher ILP from explicit parallelism than conventional architectures
    would see from OoO (apparently complex, and initially without anything
    like IA-64's ALAT)?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Feb 21 20:15:34 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 2/20/2026 5:49 PM, MitchAlsup wrote:
    ----------------------------

    There is a non-zero risk though when one disallows uses that are theoretically allowed in the ISA, even if GCC doesn't use them.

    This is why one must decode all 32-bits of each instruction--so that
    there is no hole in the decoder that would allow the core to do some-
    thing not directly specified in ISA. {And one of the things that make
    an industrial quality ISA so hard to fully specify.}}
    ---------------------

    Well, and in terms of typical ASM notation, there is this mess:
    (Rb) / @Rb / @(Rb) //load/store register
    (Rb, Disp) / Disp(Rb) //load/store disp
    @(Rb, Disp) / @(Disp, Rb) //load/store disp (but with @)
    Then:
    (Rb, Ri) //indexed (element sized index)
    Ri(Rb) //indexed (byte-scaled index)
    (Rb, Ri, Sc) //indexed with scale
    Disp(Rb, Ri) //indexed with displacement
    Disp(Rb, Ri, Sc) //indexed with displacement and scale
    Then:
    @Rb+ / (Rb)+ //post-increment
    @-Rb / -(Rb) //pre-decrement
    @Rb- / (Rb)- //post-decrement
    @+Rb / +(Rb) //pre-increment

    And, in some variants, all the registers prefixed with '%'.

    Leading to SERIAL DECODE--which is BAD.
    -----------------------
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Feb 21 20:28:00 2026
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    [email protected] (John Dallman) writes:
    [some unattributed source writes:]
    ----------------
    Actually, other architectures also added prefetching instructions for
    dealing with that problem. All I have read about that was that there
    were many disappointments when using these instructions. I don't know
    if there were any successes, and how frequent they were compared to
    the disappointments. So I don't see that IA-64 was any different from
    other architectures in that respect.

    If there had been any significant successes, you would have heard about them.

    -------------------------------
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    Array and Matrix scientific codes with datasets bigger than cache.

    ----------------------------
    I have certainly read about interesting results for binary search (in
    an array) where the branching version outperforms the branchless
    version because the branching version speculatively executes several
    followup loads, and even in unpredictable cases the speculation should
    be right in 50% of the cases, resulting in an effective doubling of memory-level parallelism (but at a much higher cost in memory
    subsystem load). But other than that, I don't see that speculative
    execution helps with memory latency.

    At the cost of opening the core up to Spectré-like attacks. ------------------------
    The instruction format makes no difference. Having so many registers
    may have mad it harder than otherwise, but SPARC also used many
    registers, and we have seen OoO implementations (and discussed them a
    few months ago). The issue is that speculative execution and OoO
    makes all the EPIC features of IA-64 unnecessary, so if they cannot do
    a fast in-order implementation of IA-64 (and they could not), they
    should just give up and switch to an architecture without these
    features, such as AMD64. And Intel did, after a few years of denying.
    The remainder of IA-64 was just to keep it alive for the customers who
    had bought into it.

    IA-64 was to prevent AMD (and others) from clones--that is all. Intel/HP
    would have had a patent wall 10 feet tall.

    It did not fail by being insufficiently protected, it failed because it
    did not perform.
    -----------------
    For a few cases, they have. But the problem is that these cases are
    also vectorizable.

    Their major problem was that they did not get enough ILP from
    general-purpose code.

    And even where they had enough ILP, OoO CPUs with SIMD extensions ate
    their lunch.

    And that should have been the end of the story. ... Sorry Ivan.


    - anton
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Feb 21 20:38:51 2026
    From Newsgroup: comp.arch


    [email protected] (Anton Ertl) posted:

    Stefan Monnier <[email protected]> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    That did not solve memory latency, but that's a problem even for OoO
    cores.

    I suspect a big part of the problem was tension between Intel and HP
    were the only political solution was allowing the architects from both
    sides to "dump in" their favorite ideas. A recipe for disaster.

    The HP side had people like Bob Rau (Cydrome) and Josh Fisher
    (Multiflow), and given their premise, the architecture is ok; somewhat
    on the complex side, but they wanted to cover all the good ideas from
    earlier designs; after all, it was to be the one architecture to rule
    them all (especially performancewise). You cannot leave out a feature
    that a competitor could then add to outperform IA-64.

    In this time period, performance was doubling every 14 months, so if a
    feature added x performance it MUST avoid adding more than x/14 months
    to the schedule. If IA-64 was 2 years earlier, it would have been com- petitive--sadly it was not.
    ---------------------
    To me the main question is whether they were truly confused and just got >lucky (lucky because they still managed to sell their idea enough that
    most RISC companies folded),

    I think most RISC companies had troubles scaling. They were used to
    small design teams spinning out simple RISCs in a short time, and did
    not have the organization to deal with the much larger projects that
    OoO superscalars required.

    Most RISC teams did not have the cubic dollars of revenue to afford the
    team size needed for GBOoO design--nor, BTW, the management expertise
    to run such a large organization efficiently.

    And while everybody inventing their own architecture may have looked like a good idea when developing an
    architecture and its implementations was cheap,

    1-wide, and a bit of 2-wide.

    it looked like a bad
    deal when development costs started to ramp up in the mid-90s. That's
    why HP went to Intel, and other companies (in particular, SGI) took
    this as an exit strategy from the own-RISC business.

    DEC had increasing delays in their chips, and eventually could not
    make enough money with them and had to sell themselves to Compaq (who
    also could not sustain the effort and sold themselves to HP (who
    canceled Alpha development)). I doubt that IA-64 played a big role in
    that game.

    Back to IA-64: At the time, when OoO was just starting, the premise of
    IA-64 looked plausible. Why wouldn't they see a fast clock rate and
    higher ILP from explicit parallelism than conventional architectures
    would see from OoO (apparently complex, and initially without anything
    like IA-64's ALAT)?

    - anton
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sat Feb 21 14:59:54 2026
    From Newsgroup: comp.arch

    On 2/21/2026 2:15 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/20/2026 5:49 PM, MitchAlsup wrote:
    ----------------------------

    There is a non-zero risk though when one disallows uses that are
    theoretically allowed in the ISA, even if GCC doesn't use them.

    This is why one must decode all 32-bits of each instruction--so that
    there is no hole in the decoder that would allow the core to do some-
    thing not directly specified in ISA. {And one of the things that make
    an industrial quality ISA so hard to fully specify.}}
    ---------------------

    Sometimes there is a tension:
    What is theoretically allowed in the ISA;
    What is the theoretically expected behavior in some abstract model;
    What stuff is actually used by compilers;
    What features or behaviors does one want;
    ...

    Implementing RISC-V strictly as per an abstract model would limit both efficiency and hinder some use-cases.

    Then it comes down to "what do compilers do" and "what unintentional
    behaviors could an ASM programmer stumble onto unintentionally".

    Stuff like "Program misbehaves or crashes on a fairly mundane piece of
    code" are preferably avoided.


    Alternatives being, say:
    Define behaviors what programs are allowed to rely on;
    Be slightly conservative with how one defines edge cases;
    Avoid over-defining things too far outside the scope of what is actually relevant.

    Sometimes design elegance can become a trap.


    But, OTOH, having special cases for some instructions based on which
    registers or immediate values are used isn't exactly clean or elegant.

    Like, yeah:
    Using X0 or X1 here invokes magic;
    Instruction doesn't work unless X0 or X1;
    ...



    Well, and in terms of typical ASM notation, there is this mess:
    (Rb) / @Rb / @(Rb) //load/store register
    (Rb, Disp) / Disp(Rb) //load/store disp
    @(Rb, Disp) / @(Disp, Rb) //load/store disp (but with @)
    Then:
    (Rb, Ri) //indexed (element sized index)
    Ri(Rb) //indexed (byte-scaled index)
    (Rb, Ri, Sc) //indexed with scale
    Disp(Rb, Ri) //indexed with displacement
    Disp(Rb, Ri, Sc) //indexed with displacement and scale
    Then:
    @Rb+ / (Rb)+ //post-increment
    @-Rb / -(Rb) //pre-decrement
    @Rb- / (Rb)- //post-decrement
    @+Rb / +(Rb) //pre-increment

    And, in some variants, all the registers prefixed with '%'.

    Leading to SERIAL DECODE--which is BAD.
    -----------------------

    Depends on what ISA the ASM syntax is actually attached to...
    If it is a VAX, yeah, true enough.

    Seemingly most of these syntax variants go back to PDP-11 and VAX origins.

    Then some quirks, like '%' on register names, apparently mostly came
    from the M68K branch:
    PDP/VAX: No '%'
    M68K: Added '%'
    GAS on x86: Mostly kept using M68K notation.


    Then apparently the '@' thing was partly a thing originating either in
    Hitachi or Texas Instruments (along with putting '.' in many of the instruction mnemonics).

    So, if working backwards, could drop all the '@' variants, along with
    '%', ...


    Where, apparently, the syntax scheme I had mostly ended up using for
    BGBCC and my own stuff, ended up partly mutating back towards the
    original PDP/VAX style syntax.

    Namely, was using:
    (Rb)
    (Rb, Rb)
    (Rb, Disp) | Disp(Rb)
    ...

    Though, had mostly kept the dotted names vs reverting to dot-free names.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Feb 21 22:56:47 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 2/21/2026 2:15 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/20/2026 5:49 PM, MitchAlsup wrote:
    ----------------------------

    There is a non-zero risk though when one disallows uses that are
    theoretically allowed in the ISA, even if GCC doesn't use them.

    This is why one must decode all 32-bits of each instruction--so that
    there is no hole in the decoder that would allow the core to do some-
    thing not directly specified in ISA. {And one of the things that make
    an industrial quality ISA so hard to fully specify.}}
    ---------------------

    Sometimes there is a tension:
    What is theoretically allowed in the ISA;
    What is the theoretically expected behavior in some abstract model;
    What stuff is actually used by compilers;
    What features or behaviors does one want;
    ...
    Whether your ISA can be attacked with Spectré and/or Meltdown;
    Whether your DFAM can be attacked with RowHammer;
    Whether your call/return interface can be attacked with:
    { Return Orienter Programmeing, Buffer Overflows, ...}

    That is; whether you care if your system provides a decently robust
    programming environment.

    I happen to care. Apparently, most do not.

    Implementing RISC-V strictly as per an abstract model would limit both efficiency and hinder some use-cases.

    One can make an argument that it is GOOD to limit attack vectors, and
    provide a system that is robust in the face of attacks.

    Then it comes down to "what do compilers do" and "what unintentional behaviors could an ASM programmer stumble onto unintentionally".

    Nïeve at best.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Sun Feb 22 03:00:59 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <[email protected]>:
    Stefan Monnier <[email protected]> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    I knew the Multiflow people pretty well when I was at Yale. Trace
    scheduling was inspired by the FPS AP-120B, which had wide
    instructions issuing multiple operations and was extremely hard to
    program efficiently.

    Multiflow's compiler worked pretty well and did a good job of static
    scheduling memory operations when the access patterns weren't too data dependent. It was good enough that Intel and HP both licensed it and
    used it in their VLIW projects. It was a good match for the hardware
    in the 1980s.

    But as computer hardware got faster and denser, it became possible to
    do the scheduling on the fly in hardware, so you could get comparable performance with conventional instruction sets in a microprocessor.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Feb 22 09:16:00 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    But as computer hardware got faster and denser, it became possible to
    do the scheduling on the fly in hardware, so you could get comparable >performance with conventional instruction sets in a microprocessor.

    Actually, OoO microprocessors appeared before IA-64 implementations
    were originally planned to be released, and were implemented in larger processes, i.e., they consumed fewer hardware resources.

    The Pentium Pro was implemented with 5.5M transistors in a 0.35um
    process (with 8+8KB L1 cache) and a die area of 306mm^2 (probably
    including the separate L2 cache chip). Later Intel released the
    Klamath Pentium II also in 0.35um, but with 16+16KB L1, with 7.5M
    transistors and a die size of 203mm^2 (the die should be larger than
    the CPU die of the Pentium Pro, that's why I think that the Pentium
    Pro number includes the L2 cache die); die size numbers from https://pc.watch.impress.co.jp/docs/2008/1027/kaigai_5.pdf

    The PA-8000 is a 4-wide OoO CPU implemented with 3.8M transistors in a
    0.5um process in 337.69mm^2. It has all caches off-chip.

    The Merced Itanium and McKinley Itanium II were 6-wide and implemented
    in 180nm, the same feature size as the Willamette Pentium 4 and
    Thunderbird Athlon. The Merced is reported as having 25.4M
    transistors (with 16+16KB L1 and 96KB of L2 cache plus 295M
    transistors for 4MB external L3 cache). The McKinley is reported as
    having a die size of 421mm^2 and a transistor count of 221M (with
    16+16KB L1, 256KB L2 and 3MB L3).

    Looking at <https://en.wikipedia.org/wiki/Itanium#Design_and_delays:_1994%E2%80%932001>,
    I read:

    |When Merced was floorplanned for the first time in mid-1996, it turned
    |out to be far too large [...]. The designers had to reduce the
    |complexity (and thus performance) of subsystems, including the x86
    |unit and cutting the L2 cache to 96 KB.[d] Eventually it was agreed
    |that the size target could only be reached by using the 180 nm process |instead of the intended 250 nm.

    For comparison, in the same 0.18um process the Willamette included
    8+8KB L1 and 256KB L2 cache in 217mm^2, and in the same-sized process
    the AMD Thunderbird and Palomino included 64+64KB L1 and 256KB L2
    cache.

    Unfortunately, the caches dominate the transistor counts, so one
    cannot tell how many transistors were needed for implementing the data
    path and cotrol stuff.

    We do have in-order CPUs such as the 4-wide 21164: 9.3M transistors
    (with 8+8KB L1 and 96KB L2), 299mm^2 in a 0.5um process.

    So the OoOness of the PA-8000 may have cost around as much area as the
    caches of the 21164 (and the higher clock rate of the 21164 compared
    to the PA-8000 and the Pentium Pro supported the theory that OoO is
    inherently slower).

    Comparing the 21164 to the Merced, the L2 cache sizes are the same and
    the L1 size of Merced is twice that of the 21164, yet the Merced takes
    2.7x the number of transistors of the 21164, and probably a lot of the additional transistors are not for the additional L1 caches. It seems
    that the architectural features and/or maybe the 6-wide implementation
    of the Merced cost a lot of transistors and thus die area, whereas a
    sales pitch for EPIC was that thanks to the explicit grouping of
    instructions, the supposedly quadratic cost of checking for register-to-register dependences would be eliminated, resulting in
    more area for additional functional units.

    Bottom line: If EPIC is easier to fit on a microprocessor, there is no
    evidence for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Feb 22 13:17:30 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:

    [email protected] (Anton Ertl) posted:
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    Array and Matrix scientific codes with datasets bigger than cache.

    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".

    I have certainly read about interesting results for binary search (in
    an array) where the branching version outperforms the branchless
    version because the branching version speculatively executes several
    followup loads, and even in unpredictable cases the speculation should
    be right in 50% of the cases, resulting in an effective doubling of
    memory-level parallelism (but at a much higher cost in memory
    subsystem load). But other than that, I don't see that speculative
    execution helps with memory latency.

    At the cost of opening the core up to Spectré-like attacks.

    There may be a way to avoid the side channel while still supporting
    this scenario. But I think that there are better ways to speed up
    such a binary search:

    Here software prefetching can really help: prefetch one level ahead (2 prefetches), or two levels ahead (4 prefetches), three levels (8
    prefetches), or four levels (16 prefetches), etc., whatever gives the
    best performance (which may be hardware-dependent). The result is a
    speedup of the binary search by (in the limit) levels+1.

    By contrast, the branch prediction "prefetching" provides a factor 1.5
    at twice the number of loads when one branch is predicted, 1.75 at 3x
    the number of loads when two branches are predicted, etc. up to a
    speedup factor of 2 with an infinite of loads and predicted branches;
    that's for completely unpredictable lookups, with some predictability
    the branch prediction approach performs better, and with good
    predictability it should outdo the software-prefetching approach for
    the same number of additional memory accesses.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Feb 22 13:37:04 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:

    [email protected] (Anton Ertl) posted:
    The HP side had people like Bob Rau (Cydrome) and Josh Fisher
    (Multiflow), and given their premise, the architecture is ok; somewhat
    on the complex side, but they wanted to cover all the good ideas from
    earlier designs; after all, it was to be the one architecture to rule
    them all (especially performancewise). You cannot leave out a feature
    that a competitor could then add to outperform IA-64.

    In this time period, performance was doubling every 14 months, so if a >feature added x performance it MUST avoid adding more than x/14 months
    to the schedule. If IA-64 was 2 years earlier, it would have been com- >petitive--sadly it was not.

    No, if a feature adds a year in development time, you start a year
    earlier (or alternatively target a release a year later).

    Intel adds features to AMD64 (or "Intel 64", as they call it) all the
    time, usually with little immediate performance impact, but they
    managed to keep their schedules, at least while the process advances
    also kept to their schedules (which broke down around 2016).

    For IA-64, Intel/HP later did not add ISA features, and still did not
    result in competetive performance for general-purpose code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sun Feb 22 17:05:23 2026
    From Newsgroup: comp.arch

    On Sun, 22 Feb 2026 13:17:30 GMT
    [email protected] (Anton Ertl) wrote:
    MitchAlsup <[email protected]d> writes:

    [email protected] (Anton Ertl) posted:
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come
    to mind, but are they really common, apart from programming
    classes?

    Array and Matrix scientific codes with datasets bigger than cache.

    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".

    BLAS Level 3 is not particularly external/LLC bandwidth intensive even
    without hardware predictors. Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.
    I have certainly read about interesting results for binary search
    (in an array) where the branching version outperforms the
    branchless version because the branching version speculatively
    executes several followup loads, and even in unpredictable cases
    the speculation should be right in 50% of the cases, resulting in
    an effective doubling of memory-level parallelism (but at a much
    higher cost in memory subsystem load). But other than that, I
    don't see that speculative execution helps with memory latency.

    At the cost of opening the core up to Spectr_�-like attacks.

    There may be a way to avoid the side channel while still supporting
    this scenario. But I think that there are better ways to speed up
    such a binary search:

    Here software prefetching can really help: prefetch one level ahead (2 prefetches), or two levels ahead (4 prefetches), three levels (8
    prefetches), or four levels (16 prefetches), etc., whatever gives the
    best performance (which may be hardware-dependent). The result is a
    speedup of the binary search by (in the limit) levels+1.

    By contrast, the branch prediction "prefetching" provides a factor 1.5
    at twice the number of loads when one branch is predicted, 1.75 at 3x
    the number of loads when two branches are predicted, etc. up to a
    speedup factor of 2 with an infinite of loads and predicted branches;
    that's for completely unpredictable lookups, with some predictability
    the branch prediction approach performs better, and with good
    predictability it should outdo the software-prefetching approach for
    the same number of additional memory accesses.

    - anton
    The recent comparisons of branchy vs branchless binary search that we
    carried on RWT forum seems to suggest that on modern CPUs branchless
    variant is faster even when the table does not fit in LLC.
    Branchy variant manages to pull ahead only when TLB misses can't be
    served from L2$.
    At least that how I interpreted it.
    Here is result on very modern CPU: https://www.realworldtech.com/forum/?threadid=223776&curpostid=223974
    And here is older gear: https://www.realworldtech.com/forum/?threadid=223776&curpostid=223895
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Sun Feb 22 11:51:36 2026
    From Newsgroup: comp.arch

    Anton Ertl [2026-02-21 18:41:12] wrote:
    Stefan Monnier <[email protected]> writes:
    MitchAlsup <[email protected]d> wrote:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.
    I can't remember seeing such arguments comping from compiler people, tho.
    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    Of course, compiler people have worked on such problems and solved some
    cases. But what I wrote above is that "I can't remember seeing
    ... compiler people" claiming that "{sooner or later} compilers COULD
    figure stuff like this out".

    The major problem was that the premise was wrong. They assumed that
    in-order would give them a clock rate edge, but that was not the case,
    right from the start (The 1GHz Itanium II (released July 2002)
    competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
    XP (released June 2002)). They also assumed that explicit parallelism
    would provide at least as much ILP as hardware scheduling of OoO CPUs,
    but that was not the case for general-purpose code, and in any case,
    they needed a lot of additional ILP to make up for their clock speed disadvantage.

    Definitely.

    The odd thing is that these were hardware companies betting on "someone >>else" solving their problem, yet if compiler people truly had managed to >>solve those problems, then other hardware companies could have taken >>advantage just as well.
    I am sure they had patents on stuff like the advanced load and the
    ALAT, so no, other hardware companies would have had a hard time.

    I'm pretty sure that if compiler people ever solve the problems that
    plagued the Itanium, those same solutions can bring similar benefits to architectures using other (non-patented) mechanisms.


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Feb 22 19:20:51 2026
    From Newsgroup: comp.arch


    John Levine <[email protected]> posted:

    According to Anton Ertl <[email protected]>:
    Stefan Monnier <[email protected]> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software >pipelining)), which in turn is based on the work on compilers for >microcode.

    I knew the Multiflow people pretty well when I was at Yale. Trace
    scheduling was inspired by the FPS AP-120B, which had wide
    instructions issuing multiple operations and was extremely hard to
    program efficiently.

    Multiflow's compiler worked pretty well and did a good job of static scheduling memory operations when the access patterns weren't too data dependent. It was good enough that Intel and HP both licensed it and
    used it in their VLIW projects. It was a good match for the hardware
    in the 1980s.

    But as computer hardware got faster and denser, it became possible to
    do the scheduling on the fly in hardware, so you could get comparable performance with conventional instruction sets in a microprocessor.

    Not comparable; superior.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Sun Feb 22 20:14:10 2026
    From Newsgroup: comp.arch

    According to Stefan Monnier <[email protected]>:
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    Of course, compiler people have worked on such problems and solved some >cases. But what I wrote above is that "I can't remember seeing
    ... compiler people" claiming that "{sooner or later} compilers COULD
    figure stuff like this out".

    I recall Multiflow people telling me that trace scheduling did a great job
    of scheduling memory accesses when the patterns were predictable and that it didn't when they weren't, i.e., data dependent.

    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in applications where the access patterns are predictable.

    Are there enough applications like that to make VLIWs worth it? Some kinds of DSP?
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Sun Feb 22 21:52:00 2026
    From Newsgroup: comp.arch

    In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:

    Does imply that my younger self was notable, and not seen as just
    some otherwise worthless nerd.

    Educators who are any good notice the weird kids who are actually smart.

    For 128 predicate registers, this part doesn't make as much sense:

    I suspect they wanted to re-use some logic.

    The tricks Itanium could do with combinations of predicate registers were pretty weird. There was at least one instruction for manipulating them
    which I was entirely unable to understand, with the manual in front of me
    and pencil and paper to try examples. Fortunately, it never occurred in
    code generated by any of the compilers I used.

    *1: Where people argue that if each vendor can do a CPU with their
    own custom ISA variants and without needing to license or get
    approval from a central authority, that invariably everything would
    decay into an incoherent mess where there is no binary
    compatibility between processors from different vendors (usual
    implication being that people are then better off staying within
    the ARM ecosystem to avoid RV's lawlessness).

    The importance of binary compatibility is very much dependent on the
    market sector you're addressing. It's absolutely vital for consumer apps
    and games. It's much less important for current "AI" where each vendor
    has their own software stack anyway. RISC-V seems to be far more
    interested in the latter at present.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Feb 22 23:08:05 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in >applications where the access patterns are predictable.

    Concerning the scheduling, hardware scheduling looked pretty dumb at
    the time (always schedule the oldest ready instruction(s)), and
    compilers could pick the instructions on the critical path in their
    scheduling, but given the scheduling barriers in compilers (e.g.,
    calls and returns), and the window sizes in current hardware, even
    dumb is superior to smart.

    Another aspect where hardware is far superior is branch prediction.

    Are there enough applications like that to make VLIWs worth it? Some kinds of DSP?

    There have certainly been DSPs from TI (C60 series IIRC), and Phillips (TriMedia) that have VLIW architectures, so at least for a while, VLIW
    was competetive there.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Mon Feb 23 01:32:52 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <[email protected]>:
    John Levine <[email protected]> writes:
    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in >>applications where the access patterns are predictable.

    Concerning the scheduling, hardware scheduling looked pretty dumb at
    the time (always schedule the oldest ready instruction(s)), ...

    I was thinking of memory scheduling. You have multiple banks of memory
    each of which can only do one fetch or store at a time, and the goal
    was to keep all of the banks as busy as possible. If you're accessing
    an array in predictable order, trace scheduling works well, but if
    you're fetching a[b[i]] where b varies at runtime, it doesn't.

    Another aspect where hardware is far superior is branch prediction.

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy. There were certainly ugly details, e.g.,
    if there's a trap on a path that turns out not to be taken.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Mon Feb 23 06:55:08 2026
    From Newsgroup: comp.arch

    John Levine <[email protected]> writes:
    According to Anton Ertl <[email protected]>:
    John Levine <[email protected]> writes:
    Apropos another thread I can believe that IA-64 was obsolete before it was shipped
    for that reason, static scheduling will never keep up with dynamic except in >>>applications where the access patterns are predictable.

    Concerning the scheduling, hardware scheduling looked pretty dumb at
    the time (always schedule the oldest ready instruction(s)), ...

    I was thinking of memory scheduling. You have multiple banks of memory
    each of which can only do one fetch or store at a time, and the goal
    was to keep all of the banks as busy as possible. If you're accessing
    an array in predictable order, trace scheduling works well, but if
    you're fetching a[b[i]] where b varies at runtime, it doesn't.

    More advanced in-order uarchs have dealt with this by submitting
    requests to the load-store unit in-order, performing the requests as
    the resources and the memory model allow, and only letting uses of the
    loads wait for the results (Mitch Alsup has been calling such things
    OoO, too, but that's far from the modern OoO with speculative
    execution and in-order completion; in particular, there is no
    speculative execution).

    Plus, there is the option of inserting prefetch instructions, which
    give the same memory-level parallelism on in-order CPUs as on OoO
    CPUs; for a[b[i]] patterns they might be useful, unless the hardware prefetchers also know how to prefetch that.

    Moreover, AVX-512 has a gathering load instruction (and a scattering
    store instruction) that were designed for an optimized in-order
    implementation in Knights Ferry.

    Another aspect where hardware is far superior is branch prediction.

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy.

    If one goes that way, if-conversion (conversion to predicated
    execution) looked like the way to go, but it turns a control
    dependency (which vanishes if the branch is predicted correctly) into
    a data dependency. Even without that, pulling instructions from all
    ways up across branches soon runs into resource limitations. Static
    branch prediction is not great (~20% mispredicts when using
    heuristics, ~10% when using profile feedback), but it is still far
    better than assuming that all branches go both ways equally likely.

    Augustus K. Uht suggested that one keeps track of the likelyhood of
    statically not-predicted branches; after following a few prediction,
    the likelyhood of the path to the currently predicted branch would be
    smaller than the likelyhood of one of the earlier not-predicted
    branches. He suggested that the compiler should use these likelyhoods
    to guide speculation decisions.

    But in the early 1990s dynamic (hardware) branch prediction started
    producing significantly lower misprediction rates than static and even semi-static branch prediction, and branch predictors have improved
    since then; e.g., for our LaTeX benchmark Zen 4 produces the following
    results:

    1_325_396_218 cycles 4.961 GHz ( +- 0.12% )
    3_565_310_588 instructions 2.69 insn per cycle
    656_903_470 branches 2.459 G/sec ( +- 0.01% )
    8_417_229 branch-misses 1.28% of all branches ( +- 0.21% )

    That's not just 1.28% branch mispredictions, but also 2.36 branch mispredictions per 1000 instruction (MPKI), or 423 instructions
    between branch mispredictions on average. This means that the
    hardware scheduler can schedule across hundreds of instructions before
    hitting a scheduling barrier. It also means that all the speculation
    within those 423 instructions will actually be useful, whereas
    speculating on both sides of a branch (or if-conversion) guarantees
    that most of the speculated instructions will be useless.

    But yes, for instructions from behind an if that do not depend on the
    values computed in the if, one can schedule them above the if without duplication along both paths, and this looked like one of the
    smartness advantages of compilers over OoO hardware. But given the
    very high branch prediction accuracies that hardware exhibits, this
    advantage is small, and as IA-64 demonstrated, does not outweigh the disadvantages of EPIC by far.

    There were certainly ugly details, e.g.,
    if there's a trap on a path that turns out not to be taken.

    IA-64 (and EPIC in general) has the advanced load for that; IA-64 also
    does not implement division in hardware (i.e., division by zero is
    checked by software). This means that any trapping can happen once
    the branch is guaranteed to be taken, but the speculative execution
    happens earlier. If static branch prediction worked as well as
    dynamic branch prediction, that would have been one of the important
    parts in making IA-64 have as much ILP as OoO.

    I have actually written a paper on the topic:

    @InProceedings{ertl&krall94,
    author = "M. Anton Ertl and Andreas Krall",
    title = "Delayed Exceptions --- Speculative Execution of
    Trapping Instructions",
    booktitle = "Compiler Construction (CC '94)",
    year = "1994",
    publisher = "Springer LNCS~786",
    address = "Edinburgh",
    month = "April",
    pages = "158--171",
    url = "https://www.complang.tuwien.ac.at/papers/ertl-krall94cc.ps.gz",
    abstract = "Superscalar processors, which execute basic blocks
    sequentially, cannot use much instruction level
    parallelism. Speculative execution has been proposed
    to execute basic blocks in parallel. A pure software
    approach suffers from low performance, because
    exception-generating instructions cannot be executed
    speculatively. We propose delayed exceptions, a
    combination of hardware and compiler extensions that
    can provide high performance and correct exception
    handling in compiler-based speculative execution.
    Delayed exceptions exploit the fact that exceptions
    are rare. The compiler assumes the typical case (no
    exceptions), schedules the code accordingly, and
    inserts run-time checks and fix-up code that ensure
    correct execution when exceptions do happen."
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Mon Feb 23 08:06:20 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    [email protected] (Anton Ertl) wrote:

    MitchAlsup <[email protected]d> writes:

    [email protected] (Anton Ertl) posted: =20
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come
    to mind, but are they really common, apart from programming
    classes? =20

    Array and Matrix scientific codes with datasets bigger than cache. =20 >>=20
    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive even >without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why they
    have the roofline performance model (e.g., <https://docs.nersc.gov/tools/performance/roofline/>). Of course for
    the memory-bound applications the uarch style is not particularly
    important, as long as it allows to exploit the memory-level
    parallelism in some way; prefetching (by hardware or software) is a
    way that can be performed by any kind of uarch to exploit MLP.

    IA-64 implementations have performed relatively well for SPECfp. I
    don't know how memory-bound these applications were, though.

    For those applications that benefit from caches (e.g., matrix
    multiplication), memory-level parallelism is less important.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from compute-bound
    to memory-bound.

    The recent comparisons of branchy vs branchless binary search that we
    carried on RWT forum seems to suggest that on modern CPUs branchless
    variant is faster even when the table does not fit in LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of a
    binary array search and prefetches accordingly. The cache misses from
    page table accesses might confuse the prefetcher, leading to worse
    performance eventually.

    2) (doubtful) The compiler recognizes the algorithm and inserts
    software prefetch instructions.

    Branchy variant manages to pull ahead only when TLB misses can't be
    served from L2$.
    At least that how I interpreted it.

    Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974

    And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895

    I tried to run your code <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could not
    get gcc-12 to produce branchless code from slightly adapted source
    code, either. My own attempt of using extended asm did not pass your
    sanity checks, so eventually I used the assembly code produced by
    clang-19 through godbolt. Here I see that branchy is faster for the
    100M array size on Zen4 (i.e., where on the M3 branchless is faster):

    [~/binary-search:165562] perf stat branchless 100000000 10000000 10
    4041.093082 msec. 0.404109 usec/point
    Performance counter stats for 'branchless 100000000 10000000 10':

    45436.88 msec task-clock 1.000 CPUs utilized
    276 context-switches 6.074 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 28.765 /sec 226091936865 cycles 4.976 GHz
    1171349822 stalled-cycles-frontend 0.52% frontend cycles idle
    34666912723 instructions 0.15 insn per cycle
    0.03 stalled cycles per insn
    5829461905 branches 128.298 M/sec
    45963810 branch-misses 0.79% of all branches

    45.439541729 seconds time elapsed

    45.397207000 seconds user
    0.040008000 seconds sys


    [~/binary-search:165563] perf stat branchy 100000000 10000000 10
    3051.269998 msec. 0.305127 usec/point
    Performance counter stats for 'branchy 100000000 10000000 10':

    34308.48 msec task-clock 1.000 CPUs utilized
    229 context-switches 6.675 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 38.096 /sec 172472673652 cycles 5.027 GHz
    49176146462 stalled-cycles-frontend 28.51% frontend cycles idle
    33346311766 instructions 0.19 insn per cycle
    1.47 stalled cycles per insn
    10322173655 branches 300.864 M/sec
    1583842955 branch-misses 15.34% of all branches

    34.311432848 seconds time elapsed

    34.264437000 seconds user
    0.044005000 seconds sys

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Mon Feb 23 13:03:26 2026
    From Newsgroup: comp.arch

    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    [email protected] (Anton Ertl) wrote:

    MitchAlsup <[email protected]d> writes:

    [email protected] (Anton Ertl) posted: =20
    [1) stride-based. 2) pointer-chasing]
    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays
    come to mind, but are they really common, apart from programming
    classes? =20

    Array and Matrix scientific codes with datasets bigger than
    cache. =20
    =20
    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive
    even without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why they
    have the roofline performance model (e.g., <https://docs.nersc.gov/tools/performance/roofline/>).

    Sure. But that's not what I would call "dense".
    In my vocabulary "dense" starts at matmul(200x200,x200) or at LU
    decomposition of matrix of similar dimensions.
    I don't consider anything in BLAS Level 2 as "dense".
    May be, my definitions of terms are unusual.

    Of course for
    the memory-bound applications the uarch style is not particularly
    important, as long as it allows to exploit the memory-level
    parallelism in some way; prefetching (by hardware or software) is a
    way that can be performed by any kind of uarch to exploit MLP.

    IA-64 implementations have performed relatively well for SPECfp. I
    don't know how memory-bound these applications were, though.


    SPECfp2006 was mostly not memory-bound on CPUs of that era.
    OTOH, SPECfp_rate2006 for n>4 was memory-bound rather heavily.

    For those applications that benefit from caches (e.g., matrix multiplication), memory-level parallelism is less important.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from compute-bound
    to memory-bound.


    It does not slow down the execution. To the contrary, it speeds it up
    so much that speed of handling of L2 misses begins to matter.
    Pay attention that it's just my speculation that can be wrong.
    IIRC, right now binary64-capable AMX is available only on Apple Silicon
    (via SME) and may be on IBM Z. I didn't play with either.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Mon Feb 23 13:44:05 2026
    From Newsgroup: comp.arch

    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:


    The recent comparisons of branchy vs branchless binary search that we >carried on RWT forum seems to suggest that on modern CPUs branchless >variant is faster even when the table does not fit in LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of a
    binary array search and prefetches accordingly. The cache misses from
    page table accesses might confuse the prefetcher, leading to worse performance eventually.


    Coffee Lake certainly has no such prefetcher and nevertheless exhibits
    similar behavior.

    2) (doubtful) The compiler recognizes the algorithm and inserts
    software prefetch instructions.

    Branchy variant manages to pull ahead only when TLB misses can't be
    served from L2$.
    At least that how I interpreted it.

    Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974

    And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895


    I tried to run your code <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could not
    get gcc-12 to produce branchless code from slightly adapted source
    code, either. My own attempt of using extended asm did not pass your
    sanity checks, so eventually I used the assembly code produced by
    clang-19 through godbolt.

    I suppose that you have good reason for avoiding installation of clang17
    or later on one of your computers.

    Here I see that branchy is faster for the
    100M array size on Zen4 (i.e., where on the M3 branchless is faster):

    [~/binary-search:165562] perf stat branchless 100000000 10000000 10 4041.093082 msec. 0.404109 usec/point
    Performance counter stats for 'branchless 100000000 10000000 10':

    45436.88 msec task-clock 1.000 CPUs utilized
    276 context-switches 6.074 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 28.765 /sec 226091936865 cycles 4.976 GHz
    1171349822 stalled-cycles-frontend 0.52% frontend cycles idle 34666912723 instructions 0.15 insn per cycle
    0.03 stalled cycles per insn
    5829461905 branches 128.298 M/sec
    45963810 branch-misses 0.79% of all branches


    45.439541729 seconds time elapsed

    45.397207000 seconds user
    0.040008000 seconds sys


    [~/binary-search:165563] perf stat branchy 100000000 10000000 10
    3051.269998 msec. 0.305127 usec/point
    Performance counter stats for 'branchy 100000000 10000000 10':

    34308.48 msec task-clock 1.000 CPUs utilized
    229 context-switches 6.675 /sec
    0 cpu-migrations 0.000 /sec
    1307 page-faults 38.096 /sec 172472673652 cycles 5.027 GHz
    49176146462 stalled-cycles-frontend 28.51% frontend cycles idle 33346311766 instructions 0.19 insn per cycle
    1.47 stalled cycles per insn
    10322173655 branches 300.864 M/sec
    1583842955 branch-misses 15.34% of all branches


    34.311432848 seconds time elapsed

    34.264437000 seconds user
    0.044005000 seconds sys

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton


    Run for every size from 100K to 2G in increments of x sqrt(2).
    BTW, I prefer odd number of iterations.
    The point at which majority of look-ups misses L3$ at least once
    hopefully will be seen as change of slop on log(N) vs duration
    graph for branchless variant.
    I would not bother with performance counters. At least for me they
    bring more confusion than insite.

    According to my understanding, on Zen4 the ratio of main DRAM
    latency to L3 latency is much higher than on either Coffee Lake or M3,
    both of which have unified LLC instead of split L3$.
    So, if on Zen2 branchy starts to win at ~2x L3 size, I will not be
    shocked. But I will be somewhat surprised.

    I actually have access to Zen3-based EPYC, where the above mentioned
    ratio is supposed much bigger than on any competent client CPU (Intel's Lunar/Arrow Lake do not belong to this category) but this server is
    currently powered down and it's a bit of a hassle to turn it on.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Feb 23 10:14:00 2026
    From Newsgroup: comp.arch

    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth bound. An example of each -

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table),
    with a (say 300 byte) record for each of its many thousands of employees
    each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    For a memory bandwidth bound example, consider comparing two large text documents for equality.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Mon Feb 23 21:22:00 2026
    From Newsgroup: comp.arch

    In article <10ngao4$d5o$[email protected]>, [email protected] (John Levine)
    wrote:

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy. There were certainly ugly details,
    e.g., if there's a trap on a path that turns out not to be taken.

    Found a good CPU bug like that on an old AMD chip, the K6-II.

    It happened with a floating point divide by zero in the x87 registers,
    guarded by a test for division by zero, with floating-point traps enabled.
    The divide got speculatively executed, the trap was stored, the test
    revealed the divide would be by zero, the CPU tried to clean up, hit its
    bug, and just stopped. Power switch time.

    This only happened with the reverse divide instruction, which took the
    operands off the x87 stack in the opposite order from the usual FDIV. It
    was rarely used, so the bug didn't become widely known. But Microsoft's compiler used it occasionally.

    John
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Mon Feb 23 21:33:51 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    [email protected] (Anton Ertl) posted:

    IA-64 was to prevent AMD (and others) from clones--that is all. Intel/HP would have had a patent wall 10 feet tall.

    It did not fail by being insufficiently protected, it failed because it
    did not perform.

    What did they (try to) patent?

    -----------------
    For a few cases, they have. But the problem is that these cases are
    also vectorizable.

    Their major problem was that they did not get enough ILP from
    general-purpose code.

    And even where they had enough ILP, OoO CPUs with SIMD extensions ate
    their lunch.

    And that should have been the end of the story. ... Sorry Ivan.

    Things have been very quiet there. The last post in their
    forum talks about the Mill trying to get to the same level as
    Intel/AMD with Coremark, and they say they are going to go to
    other microbenchmarks, where the Mill is supposed to be better.

    I'm not sure...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Mon Feb 23 17:17:59 2026
    From Newsgroup: comp.arch

    On 2/18/26 3:45 PM, BGB wrote:
    On 2/16/2026 3:14 PM, Paul Clayton wrote:
    On 11/5/25 2:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    [email protected] (Anton Ertl) posted:
    [snip]
    Branch prediction is fun.

    When I looked around online before, a lot of stuff about
    branch prediction was talking about fairly large and
    convoluted schemes for the branch predictors.

    You might be interested in looking at the 6th Championship
    Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th-
    championship-branch-prediction-cbp2025/


    Quick look, didn't see much information about who entered or won...

    This presentation (PDF) presents the winners: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/CBP2025-Closing-Remarks.pdf

    1st Place: Toru Koizumi et al.'s "RUNLTS: Register-value-aware
    predictor Utilizing Nested Large TableS"
    PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final44-Koizumi.pdf


    2nd Place: André Seznec's "TAGE-SC for CBP2025"
    PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final37-Seznec.pdf

    3rd Place: Yang Man, el. al's "LVCP: A Load-Value Correlated
    Predictor for TAGE-SC-L"
    PDF of paper: https://ericrotenberg.wordpress.ncsu.edu/files/2025/06/cbp2025-final15-Man.pdf

    The program lists the competitors and provides links to papers,
    presentations, video, and code: https://ericrotenberg.wordpress.ncsu.edu/cbp2025-workshop-program/

    I probably should have provided the link to the program to save
    any readers time and effort. (I had not remembered how many link
    traversals were required and just picked a url from my Firefox
    history.)

    (One nice aspect of this workshop/competition is that one can
    reproduce the results and try one's own designs. As long as one
    can trust that the traces are representative of the workloads
    one cares about, this seems significant.)

    TAgged GEometric length predictors (TAGE) seem to the current
    "hotness" for branch predictors. These record very long global
    histories and fold them into shorter indexes with the number of
    history bits used varying for different tables.

    (Because the correlation is less strong, 3-bit counters are
    generally used as well as a useful bit.)


    When I messed with it, increasing the strength of the saturating
    counters was not effective.

    But, increasing the ability of them to predict more complex
    patterns did help.

    Branch prediction is heuristic and the tradeoffs change based on
    the workload and the storage (and latency) budget.

    Saturating counters record localized bias not a pattern, so
    fuzzy behavior might be more accurately predicted on average
    while (as you noted) repeated patterns are less likely to be
    predicted accurately.

    Larger counters can even out some irregularity in the bias but
    increase training time. (With tags, a new entry can be
    initialized to a decent state, which can reduce training time.)

    For per-address prediction, there are ways of reducing the
    overhead for tagging each entry. If a branch target buffer (BTB)
    is used, one is already storing a target address (or at least an
    offset/cache index) so adding some tag bits is not quite as
    expensive and allows associativity; placing a per-address
    prediction in a BTB was a common design. Per-address predictions
    can also be associated with the Icache, in which case they can
    reuse the Icache tagging. Tagging reduces aliases, but obviously
    bits spent on tags cannot be spent on prediction entries
    themselves (fewer entries means evictions that reduce training
    time).

    Aliasing is a major issue even with per-address prediction; with
    global branch history aliasing is even more likely. Several
    techniques have been proposed (besides tagging) to reduce
    aliasing such as multiple diversely indexed predictors with a
    majority vote. (The never manufactured Alpha 21464 design used
    three differently indexed tables to produce a majority vote and
    a fourth table to choose whether to use the majority vote or the
    prediction form a specific table. See "Design Tradeoffs for the
    Alpha EV8 Conditional Branch Predictor" for details.)

    (Mitch Alsup would, of course, mention his agree predictor which
    uses a second source such as a per-address prediction or a
    static prediction to exploit that predictions are likely to
    agree with their local prediction and two branches that use the
    same index could have different direction biases. In addition,
    if the global predictor is used as a complement to a per-address
    predictor, different branches with the same global index may be
    likely to disagree with a per-address predictor.)

    Capacity issues can be so significant that for some workloads
    single bit predictors outperform two-bit counters. Predicting
    more static (per-address) branches somewhat accurately can be
    more important than predicting fewer branches more accurately.

    Counters also have the minor advantage of providing a confidence
    estimate that can be used for things like prediction selection,
    checkpoint selection (doing a renaming snapshot to speed
    misprediction recovery), dynamic predication, prefetch
    throttling, or execution throttling (to save power/energy).

    But, then always at the end of it using 2-bit saturating
    counters:
       weakly taken, weakly not-taken, strongly taken, strongly
    not taken.

    But, in my fiddling, there was seemingly a simple but
    moderately effective strategy:
       Keep a local history of taken/not-taken;
       XOR this with the low-order-bits of PC for the table index;
       Use a 5/6-bit finite-state-machine or similar.
         Can model repeating patterns up to ~ 4 bits.

    Indexing a predictor by _local_ (i.e., per instruction address)
    history adds a level of indirection; once one has the branch
    (fetch) address one needs to index the local history and then
    use that to index the predictor.
    [snip]

    OK, seems I wrote it wrong:
    I was using a global branch history.

    The terminology can be tricky. The worst is probably the
    difference between bimode and bimodal; for former is a similar
    to agree prediction in attempting to separate bias, the latter
    is a name for per-address prediction (two modes, taken or not
    taken).

    But, either way, the history is XOR'ed with the relevant bits
    from PC to generate the index.

    This is called a gshare predictor as distinguished from a
    gselect predictor which indexes with just the global history.
    (There is also a distinction of path history — address bits
    rather than taken/not-taken — and global branch history.) As you
    probably discovered in your experimentation, hashing in the
    address bits provides significantly better prediction.

    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.


    If the 5/6-bit FSM can fit more patterns than 3x 2-bit
    saturating counters, it can be a win.

    I suspect it very much depends on whether bias or pattern is
    dominant. This would depend on the workload (Doom?) and the
    table size (and history length). I do not know that anyone in
    academia has explored this, so I think you should be proud of
    your discovery even if it has limited application.

    A larger table (longer history) can mean longer training, but
    such also discovers more patterns and longer patterns (e.g.,
    predicting a fixed loop count). However, correlation strength
    tends to decrease with increasing distance (having multiple
    history lengths and hashings helps to find the right history).

    As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

    When the pattern is exactly repeated this is great, but if the
    correlation with global history is fuzzy (but biased) a counter
    might be better.

    I get the impression that branch prediction is complicated
    enough that even experts only have a gist of what is actually
    happening, i.e., there is a lot of craft and experimentation and
    less logical derivation (I sense).

    With the PC XOR Hist, lookup, there were still quite a few
    patterns, and not just contiguous 0s or 1s that the saturating
    counter would predict, nor just all 0s, all 1s, or ...101010...
    that the 3-bit FSM could deal with.

    But, a bigger history could mean less patterns and more
    contiguous bits in the state.

    TAGE has the advantage that the tags reduce branch aliases and
    the variable history length (with history folding/compression)
    allows using less storage (when a prediction only benefits from
    a shorter history) and reduces training time.

    In my case, was using 6-bit lookup mostly to fit into LUT6 based
    LUTRAM.

    Going bigger than 6 bits here is a pain point for FPGAs, more so
    as BRAM's don't support narrow lookups, so the next size up
    would likely be 2048x but then inefficiently using the BRAM
    (there isn't likely a particularly good way to make use of 512x
    18-bits).

    Presumably selecting specific bits of the 18 (shifting/alignment
    network) is expensive in an FPGA? With more bits per index, it
    might also make sense to use associativity with partial tags. It
    might be practical with two-way associativity to use different
    numbers of tag bits (with different address hashings) in
    different ways and different state sizes.

    [snip]
    Local history patterns may also be less common that statistical
    correlation after one has extracted branches predicted well by
    global history. (For small-bodied loops, a moderately long
    global history provides substantial local history.)

    It seems what I wrote originally was inaccurate, I don't store a
    history per-target, merely it was recent taken/not-taken branches.

    Well, I mixed up way and set in one of my comp.arch postings,
    and this was after (if I recall correctly) having read Computer
    Architecture: A Quantitative Approach as well as more than a
    couple of papers and web articles and comp.arch postings.

    But, I no longer remember what I was thinking at the time, or
    why I had written local history rather than global history
    (unless I meant "local" in terms of recency or something, I
    don't know).

    "Local" is not really a good term for per-address because of the
    potential confusion with temporal locality and the generality (a
    local branch prediction is not about predicting branches with
    nearby addresses — I am skeptical that there is much "spatial
    locality" for branch direction).

    (Terminology in computer architecture is not very well
    organized, consistent, or insightful. I feel "Instruction
    Pointer" is a better term than "Program Counter" and
    "Translation Cache" is a better term than "Translation Lookaside
    Buffer"; the former is still strong because of x86, but the
    latter had Itanium as a "backer"☹. In terms of historical
    precedence, "elbow cache" should be preferred for a cache that
    uses different indexing methods to find an alternative index for
    an evicted block rather than "cuckoo cache" which seems more
    evocative and was used for hash tables. Then there is "block"
    (IBM) versus "line" (Intel) and whether a "sector" is larger or
    smaller than a line/block and what pipeline stage is "issue" and
    which "dispatch".)

    [snip]

    A lot of this seems a lot more complex though than what would be
    all that practical on a Spartan or Artix class FPGA.

    Modern high performance branch predictors are huge! The branch
    prediction section for AMD's Zen 4 is larger than the
    instruction cache and is close to the size of (possibly larger
    than) the instruction cache and decode sections: https://substackcdn.com/image/fetch/$s_!Ak84!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d037035-aad5-4ff5-931b-c51f0409e9f3_787x378.jpeg
    (from the Chips & Cheese article "Zen 5’s 2-Ahead Branch
    Predictor Unit: How a 30 Year Old Idea Allows for New Tricks" https://chipsandcheese.com/p/zen-5s-2-ahead-branch-predictor-unit-how-30-year-old-idea-allows-for-new-tricks
    )

    I was mostly using 5/6 bit state machines as they gave better
    results than 2-bit saturating counters, and fit nicely within
    the constraints of a "history XOR PC" lookup pattern.

    I think it is very neat that you were experimenting with such.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@[email protected] to comp.arch on Sat Feb 21 23:51:32 2026
    From Newsgroup: comp.arch

    On 2/19/26 6:10 PM, John Dallman wrote:
    [snip]
    On 2/19/26 6:04 PM, BGB wrote:
    In a way, it showed that they screwed up the design pretty hard
    that x86-64 ended up being the faster and more efficient option...

    They did. They really did.

    I guess one question is if they had any other particular drawbacks
    other than, say:
    Their code density was one of the worst around;
    128 registers is a little excessive;
    128 predicate register bits is a bit WTF;

    Those huge register files had a lot to do with the low code density. They
    had two much bigger problems, though.

    The large register count seems to have been a result of both the
    integer-side register stack idea — if one is going to lazily
    save registers one might as well be able to use those registers
    within a single function — and the rejection of pipeline-
    dependent binaries where a result does not use a register name
    until the result is produced (Itanium code was defined to be
    executable by single stepping through operations).

    The hidden pipeline architectural choice also seems to have
    meant that operations using the results of non-single-cycle
    operations have to check for availability. Since the register
    number does not communicate the latency, all register sources
    would have to check a scoreboard-like structure (I think).

    (The Mill at least recognized that compiling a distribution
    format to work on the specific pipeline makes sense for static
    scheduling.)

    Loop unrolling combined with software pipelining presumably also
    motivated larger register files.

    The lack of base+immediate addressing — to avoid address
    generation latency when not necessary — also hurt code density
    even though one could sometimes use post-increment addressing to
    generate a later-used address. This probably also tended to
    increase register use as two parallel address computations would
    have to have different destination registers.

    The template system also seems to have bloated the code,
    introducing unnecessary nops. This is also a consequence of not
    exposing the pipeline; with an exposed microarchitecture the
    encoding of instruction routing could be simplified.

    They'd correctly understood that the low speed of affordable dynamic RAM
    as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads well in advance. They assumed, without evidence, that a compiler with
    plenty of time to think could schedule loads better than hardware doing
    it dynamically. It's an appealing idea, but it's wrong.

    They had 'evidence'. From the Oral history of Robert P. Colwell:

    | He said, well that's true we don't have a compiler yet, so I
    | hand assembled my simulations. I asked "How did you do
    | thousands of line of code that way?" He said “No, I did 30
    | lines of code”. Flabbergasted, I said, "You're predicting the
    | entire future of this architecture on 30 lines of hand
    | generated code?"

    That oral history has a lot of pieces that pointed to major
    organizational issues (e.g., the two people with VLIW experience
    at Intel were not involved in the project, not even for initial
    consultation).

    One implementation flaw (in my opinion) for Itanium 2 was the
    use of peak register file ports. With in-order execution and a
    decent compiler there should be (I suspect/feel) very few cases
    where ILP is hurt by substantially reducing the register file
    port count (relying on forwarding, immediates, etc. to reduce
    demand for register reads). For the FP-side, the architecture
    already implied two-wide banking for load-pair which were
    required to be even/odd pairs even with rotation. Even without
    expensive optimization, register file banking might have reduced
    port demand for such wide execution without much if any ILP
    loss.

    (I admit my feeling about register file ports is just a feeling,
    but it would have been fairly easy to test whether much existing
    code suffered lower ILP from having, e.g., only 8 read ports
    instead of 12.)

    I suspect an exposed pipeline architecture might also allow a
    compiler to schedule result forwarding such that a full all-to-
    all network might be avoided.

    It might be possible to do that effectively in a single-core,
    single-thread, single-task system that isn't taking many (if any)
    interrupts. In a multi-core system, running a complex operating system, several multi-threaded applications, and taking frequent interrupts and context switches, it is _not possible_. There is no knowledge of any of
    the interrupts, context switches or other applications at compile time,
    so the compiler has no idea what is in cache and what isn't. I don't understand why HP and Intel didn't realise this. It took me years, but I
    am no CPU designer.

    For simple interrupts, the 16 "banked" GPRs (similar to ARM's
    fast interrupt limited register set) might have been enough to
    avoid having to save the context for most interrupts. I would
    have guessed that 32 GPRs, to match the static (not
    rotating/stacked) GPRs would have been better, particularly with
    the 3D register file mechanism to hide the area cost under the
    wires of a highly ported register file, which mechanism was used
    for the multithreaded Itanium. On the other hand, to use so many
    registers for interrupt handling, it might have been necessary
    to provide a static-only calling convention.

    I suspect context switches were expected to be rare. If one is
    chasing ILP to the maximum extent, Thread-Level Parallelism may
    be ignored. Even at a low bandwidth 32-bytes per cycle, a
    context swap would take (I think) less than 150 cycles; with 1
    GHz frequency and 1 ms OS time slice this would be 0.015% of a
    time slice used by context switch overhead. If the extra state
    allowed the program to run even a tiny bit faster, that overhead
    would not be significant.

    System calls that cannot be run entirely in the banked registers
    might be more frequent than time slice thread switches, and
    blocking system calls would result in more thread switches. Yet
    it is not obvious to me that the context switch overhead was
    necessarily a major roadblock to system performance.

    There are other reasons (besides code density) to keep the most
    active set of data small, e.g., storage area and access power.

    Speculative execution addresses that problem quite effectively. We don't
    have a better way, almost thirty years after Itanium design decisions
    were taken. They didn't want to do speculative execution, and they close
    an instruction format and register set that made adding it later hard. If
    it was ever tried, nothing was released that had it AFAIK.

    The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
    enough ILP to keep those pipelines fed, or they'd just eat cache capacity
    and memory bandwidth executing no-ops ... in a very bulky instruction set. They didn't have a general way to extract enough ILP. Nobody does, even
    now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.

    There was also an architectural misfeature with floating-point advance
    loads that could make them disappear entirely if there was a call
    instruction between an advance-load instruction and the corresponding check-load instruction.

    Was this really architectural in terms of initial design intent
    or a "won't fix" bug that became a de facto architectural
    feature? Based on the special case nature and your calling it a
    bug, I am guessing the latter (which I would tend to view as
    worse).

    By the way, the Mill has, in my opinion, a better design for
    hoisted loads. The loads are not speculative (though they can
    return a not-a-thing result) and the state is saved on function
    calls. The variable latency loads do require a second load
    commit operation, but a "fixed" latency load could be, e.g., two
    cycles after a function call return (which is a highly variable
    actual latency). I think a better design is possible when
    targeting out-of-order execution. A hoisted load can be
    automatically reissued so the hardware does not have to track
    all of the "active" loads.

    If I recall correctly, Itanium required holding the destination
    register unused for the duration of the load operation (so very
    long hoisting of many loads would be expensive). Itanium also
    did not distinguish between thread-local load speculation (which
    could ignore cache snooping traffic) and speculation across
    threads. (If Itanium had been designed for TLP, it might have
    included transactional memory.)

    That cost me a couple of weeks working out and
    reporting the bug, which was unfixable. The only work-around was to
    re-issue all outstanding all floating-point advance-load instruction
    after each call returned. The effective code density went down further,
    and there were lots of extra read instructions issued.

    I guess it is more of an open question of what would have happened,
    say, if Intel had gone for an ISA design more like ARM64 or RISC-V
    or something.

    ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998.

    ARM64 is a little conservative/traditional (e.g., no variable-
    length encoding), but it was also a product of experience and
    some familiarity with compilers and hardware design. Since it is
    intended for in-order cores as well, it is not explicitly
    designed for speculative execution or even superscalar
    execution.

    RISC-V has
    not demonstrated really high performance yet, and it's been around long enough that I'm starting to doubt it ever will.

    I do not think there is anything technically preventing RISC-V
    from having a high performance implementation; it is in some
    ways substantially more sane than x86. However, high performance
    processor design is expensive.

    Look at how slow ARM is in gaining server market share even for
    cloud computing dominated by open source software. If it was not
    for Apple, one might well doubt that a fast ARM implementation
    was possible.

    I get the impression that ARM, the company, emphasizes design
    flexibility over design efficiency. I doubt a pipeline can be
    optimized to support, e.g., 32 KiB and 64 KiB L1 caches.

    I do not know how much efficiency AMD sacrifices with its shrunk
    low-frequency cores; perhaps the tools are good enough that
    almost all of the optimization opportunities are automated. I
    *guess* it is more the case that the savings from reduced
    frequency targets are so large that even a 10% loss of
    efficiency would not be problematic.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Tue Feb 24 10:41:58 2026
    From Newsgroup: comp.arch

    John Dallman wrote:
    In article <10ngao4$d5o$[email protected]>, [email protected] (John Levine)
    wrote:

    I gather speculative execution of both branch paths worked OK if the
    branch tree wasn't too bushy. There were certainly ugly details,
    e.g., if there's a trap on a path that turns out not to be taken.

    Found a good CPU bug like that on an old AMD chip, the K6-II.

    It happened with a floating point divide by zero in the x87 registers, guarded by a test for division by zero, with floating-point traps enabled. The divide got speculatively executed, the trap was stored, the test
    revealed the divide would be by zero, the CPU tried to clean up, hit its
    bug, and just stopped. Power switch time.

    This only happened with the reverse divide instruction, which took the operands off the x87 stack in the opposite order from the usual FDIV. It
    was rarely used, so the bug didn't become widely known. But Microsoft's compiler used it occasionally.

    Still an impressive find!

    I can see that since it would be (almost?) 100% reproducible, you could
    bisect the executable (in a debugger) to hone in on where it froze?

    Trying to single-step up to the crash would negate the required
    speculation, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Feb 24 09:50:43 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:


    The recent comparisons of branchy vs branchless binary search that we
    carried on RWT forum seems to suggest that on modern CPUs branchless
    variant is faster even when the table does not fit in LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of a
    binary array search and prefetches accordingly. The cache misses from
    page table accesses might confuse the prefetcher, leading to worse
    performance eventually.


    Coffee Lake certainly has no such prefetcher and nevertheless exhibits >similar behavior.

    I have now looked more closely, and the npoints parameter has a
    significant influence. If it is small enough, branchless fits into
    caches while branchy does not (see below), which might be one
    explanation for the results. Given that I have not seen npoints
    specified in any of the postings that mention branchless winning for
    veclen sizes exceeding the L3 cache size, it could be this effect. Or
    maybe some other effect. Without proper parameters one cannot
    reproduce the experiment to investigate it.

    I tried to run your code
    <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could not
    get gcc-12 to produce branchless code from slightly adapted source
    code, either. My own attempt of using extended asm did not pass your
    sanity checks, so eventually I used the assembly code produced by
    clang-19 through godbolt.

    I suppose that you have good reason for avoiding installation of clang17
    or later on one of your computers.

    Sure. It costs time.

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton


    Run for every size from 100K to 2G in increments of x sqrt(2).

    There is no integer n where 2G=100k*sqrt(2)^n. So I used the numbers
    shown below. You did not give any indication on npoints, so I
    investigated myself, and found that branchless will miss the L3 cache
    with npoints=100000, so I used that and used reps=200.

    BTW, I prefer odd number of iterations.

    You mean odd reps? Why?

    Anyway, here are the usec/point numbers:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030945 0.063620 0.038220 0.080542
    140000 0.031035 0.068244 0.034315 0.084896
    200000 0.038302 0.073819 0.045602 0.089972
    280000 0.037056 0.079651 0.042108 0.096271
    400000 0.046685 0.081895 0.055457 0.104561
    560000 0.043028 0.088356 0.055095 0.113646
    800000 0.051180 0.092201 0.074570 0.123403
    1120000 0.048806 0.096621 0.088121 0.142758
    1600000 0.060206 0.101069 0.131099 0.172171
    2240000 0.073547 0.115428 0.167353 0.205602
    3200000 0.094561 0.139996 0.208903 0.234939
    4500000 0.121049 0.162757 0.244457 0.268286
    6400000 0.152417 0.178611 0.292024 0.295204
    9000000 0.189134 0.192100 0.320426 0.327127
    12800000 0.219408 0.208083 0.372084 0.353530
    18000000 0.237684 0.222140 0.418645 0.389785
    25000000 0.270798 0.236786 0.462689 0.415937
    35000000 0.296994 0.254001 0.526235 0.451467
    50000000 0.330582 0.268768 0.599331 0.478667
    70000000 0.356788 0.288526 0.622659 0.522092
    100000000 0.388326 0.305980 0.698470 0.562841
    140000000 0.407774 0.321496 0.737884 0.609814
    200000000 0.442434 0.336242 0.848403 0.654830
    280000000 0.455125 0.356382 0.902886 0.729970
    400000000 0.496894 0.372735 1.120986 0.777920
    560000000 0.520664 0.393827 1.173606 0.855461
    800000000 0.544343 0.412087 1.759271 0.901011
    1100000000 0.584389 0.431854 1.862866 0.965724
    1600000000 0.614764 0.455844 2.046371 1.027111
    2000000000 0.622513 0.467445 2.149251 1.089775

    So branchy surpasses branchless at veclen=12.8M on both machines, for npoints=100k.

    Concerning the influence of npoints, I have worked with veclen=20M in
    the following.

    1) If <npoints> is small, for branchy the branch predictor will
    learn the pattern on the first repetition, and predict correctly in
    the following repetitions; on Zen4 and Tiger Lake I see the
    following percentage of branch predictions (for
    <npoints>*<rep>=20_000_000, <veclen>=20_000_000):

    npoints Zen4 Tiger Lake
    250 0.03% 0.51%
    500 0.03% 10.40%
    1000 0.03% 13.51%
    2000 0.07% 13.84%
    4000 13.45% 12.61%
    8000 15.26% 12.31%
    16000 15.56% 12.26%
    32000 15.60% 12.23%
    80000 15.59% 12.24%

    Tiger Lake counts slightly more branches than Zen4 (for the same
    binary): 1746M vs. 1703M, but there is also a real lower number of
    mispredictions on Tiger Lake for high npoints; at npoints=80000:
    214M on Tiger Lake vs. 266M on Zen4. My guess is that in Zen4 the
    mispredicts of the deciding branch interfere with the prediction of
    the loop branches, and that the anti-interference measures on Tiger
    Lake results in the branch predictor being less effective at
    npoints=500..2000.

    2) If <npoints> is small all the actually accessed array elements
    will fit into some cache. Also, even if npoints and veclen are
    large enough that they do not all fit, with a smaller <npoints> a
    larger part of the accesses will happen to a cache, and a larger
    part to a lower-level cache. With the same parameters as above,
    branchy sees on a Ryzen 8700G (Zen4 with 16MB L3 cache) the
    following numbers of ls_any_fills_from_sys.all_dram_io
    (LLC-load-misses and l3_lookup_state.l3_miss result in <not
    supported> on this machine), and on a Core i5-1135G (Tiger Lake with
    8MB L3) the following number of LLC-load-misses:

    branchy branchless
    8700G 1135G7 8700G 1135G7
    npoints fills LLC-load-misses fills LLC-load-misses
    250 1_156_206 227_274 1_133_672 39_212
    500 1_189_372 19_820_264 1_125_836 47_994
    1000 1_170_829 125_727_181 1_130_941 96_516
    2000 1_310_015 279_063_572 1_173_501 299_297
    4000 73_528_665 452_147_169 1_151_042 5_661_917
    8000 195_883_759 501_433_404 1_248_638 58_877_208
    16000 301_559_180 511_420_040 2_688_530 101_811_222
    32000 389_512_147 511_713_759 29_799_206 116_312_019
    80000 402_131_449 512_460_513 91_276_752 118_762_341

    The 16MB L3 of the 8700G cache has 262_144 cache lines, the 8MB L3
    of the 1135G7 has 131_072 cache lines. How come branchy has so many
    cache misses already at relatively low npoints? Each search
    performs ~25 architectural memory accesses; in addition, in branchy
    we see a number of misspeculated memory accesses for each
    architectural access, resulting in the additional memory accesses.

    For branchless the 8700G sees a ramp-up of memory accesses more than
    the factor of 2 later compared to the 1135G7 that the differences in
    cache sizes would suggest. The 1135G7 cache is divided into 4 2MB
    slices, and accesses are assigned by physical address (i.e., the L3
    cache does not function like an 8MB cache with a higher
    associativity), but given the random-access nature of this workload,
    I would expect the accesses to be distributed evenly across the
    slices, with little bad effects from this cache organization.

    Given the memory access numbers of branchy and branchless above, I
    expected to see a speedup of branchy in those cases where branchless
    has a lot of L3 misses, so I decided to use npoints=100k and rep=200
    in the experiments further up. And my expectations turned out to be
    right, at least on these two machines.

    - anton










































































































































































    The point at which majority of look-ups misses L3$ at least once
    hopefully will be seen as change of slop on log(N) vs duration
    graph for branchless variant.
    I would not bother with performance counters. At least for me they
    bring more confusion than insite.

    According to my understanding, on Zen4 the ratio of main DRAM
    latency to L3 latency is much higher than on either Coffee Lake or M3,
    both of which have unified LLC instead of split L3$.
    So, if on Zen2 branchy starts to win at ~2x L3 size, I will not be
    shocked. But I will be somewhat surprised.

    I actually have access to Zen3-based EPYC, where the above mentioned
    ratio is supposed much bigger than on any competent client CPU (Intel's >Lunar/Arrow Lake do not belong to this category) but this server is
    currently powered down and it's a bit of a hassle to turn it on.


    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Feb 24 10:46:32 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    [email protected] (Anton Ertl) wrote:

    MitchAlsup <[email protected]d> writes:
    Array and Matrix scientific codes with datasets bigger than
    cache. =20
    =20
    The dense cases are covered by stride-based hardware predictors, so
    they are not "otherwise". I am not familiar enough with sparse
    scientific codes to comment on whether they are 1), 2), or
    "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive
    even without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why they
    have the roofline performance model (e.g.,
    <https://docs.nersc.gov/tools/performance/roofline/>).

    Sure. But that's not what I would call "dense".
    In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.

    "Dense matrices" vs. "sparse matrices", not in terms of FLOPS/memory
    access.

    So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
    beyond that coming from the bandwidth limits (if any).

    Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
    but is memory bandwidth limited.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's no
    longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from compute-bound
    to memory-bound.


    It does not slow down the execution. To the contrary, it speeds it up
    so much that speed of handling of L2 misses begins to matter.
    Pay attention that it's just my speculation that can be wrong.
    IIRC, right now binary64-capable AMX is available only on Apple Silicon
    (via SME) and may be on IBM Z. I didn't play with either.

    AMX is an Intel extension of AMD64 (and IA-32); ARM's SME also has
    "matrix" in its name, but is not AMX. Looking at <https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions>, it seems
    to me that the point of AMX is to deal with small matrices (16x64
    times 64x16 for Int8, 16x32 times 32x16 for 16-bit types) of small
    elements (INT8, BF16, FP16 and complex FP16 numbers) in a special
    unit. Apparently the AMX unit in Granite Rapids consumes 2048 bytes
    in 16 cycles, i.e., 128 bytes per cycle and produces 256 or 512 bytes
    in these 16 cycles. If every of these matrix multiplication happens
    happens on its own, the result will certainly be bandwidth-bound to L2
    and maybe already to L1. If, OTOH, these operations are part of a
    larger matrix multiplication, then cache blocking can probably lower
    the bandwidth to L2 enough, and reusing one of the operands in
    registers can lower the bandwidth to L1 enough.

    In any case, Intel will certainly not add hardware that exceeds the
    bandwidth boundaries in all common use cases.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Feb 24 11:25:08 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth >bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table), >with a (say 300 byte) record for each of its many thousands of employees >each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales" >department. With no index on "department", but it is at a fixed >displacement within each record, the code looks at each record, does a >trivial test on it, perhaps adds to a register, then goes to the next >record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming
    language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Tue Feb 24 12:30:56 2026
    From Newsgroup: comp.arch

    Anton Ertl <[email protected]> schrieb:
    So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
    beyond that coming from the bandwidth limits (if any).

    Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
    but is memory bandwidth limited.

    CFD codes are usually memory bandwidth limited; they use sparse
    matrices.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Tue Feb 24 17:23:27 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 09:50:43 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:


    The recent comparisons of branchy vs branchless binary search
    that we carried on RWT forum seems to suggest that on modern CPUs
    branchless variant is faster even when the table does not fit in
    LLC.=20

    Only two explanations come to my mind:

    1) The M3 has a hardware prefetcher that recognizes the pattern of
    a binary array search and prefetches accordingly. The cache
    misses from page table accesses might confuse the prefetcher,
    leading to worse performance eventually.


    Coffee Lake certainly has no such prefetcher and nevertheless
    exhibits similar behavior.

    I have now looked more closely, and the npoints parameter has a
    significant influence. If it is small enough, branchless fits into
    caches while branchy does not (see below), which might be one
    explanation for the results. Given that I have not seen npoints
    specified in any of the postings that mention branchless winning for
    veclen sizes exceeding the L3 cache size, it could be this effect.

    I think that it was said more than once throughout the thread that all measurements were taken with npoints=1M and rep=11.

    Or
    maybe some other effect. Without proper parameters one cannot
    reproduce the experiment to investigate it.

    I tried to run your code
    <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
    on Zen4, but clang-14 converts uut2.c into branchy code. I could
    not get gcc-12 to produce branchless code from slightly adapted
    source code, either. My own attempt of using extended asm did not
    pass your sanity checks, so eventually I used the assembly code
    produced by clang-19 through godbolt.

    I suppose that you have good reason for avoiding installation of
    clang17 or later on one of your computers.

    Sure. It costs time.

    If you have recommendations what to use for the other parameters, I
    can run other sizes as well.

    - anton


    Run for every size from 100K to 2G in increments of x sqrt(2).

    There is no integer n where 2G=100k*sqrt(2)^n. So I used the numbers
    shown below. You did not give any indication on npoints, so I
    investigated myself, and found that branchless will miss the L3 cache
    with npoints=100000, so I used that and used reps=200.

    BTW, I prefer odd number of iterations.

    You mean odd reps? Why?

    Because I like quick tests. Since I always use large npoints, for test
    to finish quickly I have to use small rep. And for small rep
    even-size median has non-negligible bias.



    Anyway, here are the usec/point numbers:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030945 0.063620 0.038220 0.080542
    140000 0.031035 0.068244 0.034315 0.084896
    200000 0.038302 0.073819 0.045602 0.089972
    280000 0.037056 0.079651 0.042108 0.096271
    400000 0.046685 0.081895 0.055457 0.104561
    560000 0.043028 0.088356 0.055095 0.113646
    800000 0.051180 0.092201 0.074570 0.123403
    1120000 0.048806 0.096621 0.088121 0.142758
    1600000 0.060206 0.101069 0.131099 0.172171
    2240000 0.073547 0.115428 0.167353 0.205602
    3200000 0.094561 0.139996 0.208903 0.234939
    4500000 0.121049 0.162757 0.244457 0.268286
    6400000 0.152417 0.178611 0.292024 0.295204
    9000000 0.189134 0.192100 0.320426 0.327127
    12800000 0.219408 0.208083 0.372084 0.353530
    18000000 0.237684 0.222140 0.418645 0.389785
    25000000 0.270798 0.236786 0.462689 0.415937
    35000000 0.296994 0.254001 0.526235 0.451467
    50000000 0.330582 0.268768 0.599331 0.478667
    70000000 0.356788 0.288526 0.622659 0.522092
    100000000 0.388326 0.305980 0.698470 0.562841
    140000000 0.407774 0.321496 0.737884 0.609814
    200000000 0.442434 0.336242 0.848403 0.654830
    280000000 0.455125 0.356382 0.902886 0.729970
    400000000 0.496894 0.372735 1.120986 0.777920
    560000000 0.520664 0.393827 1.173606 0.855461
    800000000 0.544343 0.412087 1.759271 0.901011
    1100000000 0.584389 0.431854 1.862866 0.965724
    1600000000 0.614764 0.455844 2.046371 1.027111
    2000000000 0.622513 0.467445 2.149251 1.089775

    So branchy surpasses branchless at veclen=12.8M on both machines, for npoints=100k.

    Concerning the influence of npoints, I have worked with veclen=20M in
    the following.

    1) If <npoints> is small, for branchy the branch predictor will
    learn the pattern on the first repetition, and predict correctly in
    the following repetitions; on Zen4 and Tiger Lake I see the
    following percentage of branch predictions (for
    <npoints>*<rep>=20_000_000, <veclen>=20_000_000):

    npoints Zen4 Tiger Lake
    250 0.03% 0.51%
    500 0.03% 10.40%
    1000 0.03% 13.51%
    2000 0.07% 13.84%
    4000 13.45% 12.61%
    8000 15.26% 12.31%
    16000 15.56% 12.26%
    32000 15.60% 12.23%
    80000 15.59% 12.24%

    Tiger Lake counts slightly more branches than Zen4 (for the same
    binary): 1746M vs. 1703M, but there is also a real lower number of
    mispredictions on Tiger Lake for high npoints; at npoints=80000:
    214M on Tiger Lake vs. 266M on Zen4. My guess is that in Zen4 the
    mispredicts of the deciding branch interfere with the prediction of
    the loop branches, and that the anti-interference measures on Tiger
    Lake results in the branch predictor being less effective at
    npoints=500..2000.

    2) If <npoints> is small all the actually accessed array elements
    will fit into some cache. Also, even if npoints and veclen are
    large enough that they do not all fit, with a smaller <npoints> a
    larger part of the accesses will happen to a cache, and a larger
    part to a lower-level cache. With the same parameters as above,
    branchy sees on a Ryzen 8700G (Zen4 with 16MB L3 cache) the
    following numbers of ls_any_fills_from_sys.all_dram_io
    (LLC-load-misses and l3_lookup_state.l3_miss result in <not
    supported> on this machine), and on a Core i5-1135G (Tiger Lake
    supported> with
    8MB L3) the following number of LLC-load-misses:

    branchy branchless
    8700G 1135G7 8700G 1135G7
    npoints fills LLC-load-misses fills LLC-load-misses
    250 1_156_206 227_274 1_133_672 39_212
    500 1_189_372 19_820_264 1_125_836 47_994
    1000 1_170_829 125_727_181 1_130_941 96_516
    2000 1_310_015 279_063_572 1_173_501 299_297
    4000 73_528_665 452_147_169 1_151_042 5_661_917
    8000 195_883_759 501_433_404 1_248_638 58_877_208
    16000 301_559_180 511_420_040 2_688_530 101_811_222
    32000 389_512_147 511_713_759 29_799_206 116_312_019
    80000 402_131_449 512_460_513 91_276_752 118_762_341

    The 16MB L3 of the 8700G cache has 262_144 cache lines, the 8MB L3
    of the 1135G7 has 131_072 cache lines. How come branchy has so many
    cache misses already at relatively low npoints? Each search
    performs ~25 architectural memory accesses; in addition, in branchy
    we see a number of misspeculated memory accesses for each
    architectural access, resulting in the additional memory accesses.

    For branchless the 8700G sees a ramp-up of memory accesses more than
    the factor of 2 later compared to the 1135G7 that the differences in
    cache sizes would suggest. The 1135G7 cache is divided into 4 2MB
    slices, and accesses are assigned by physical address (i.e., the L3
    cache does not function like an 8MB cache with a higher
    associativity), but given the random-access nature of this workload,
    I would expect the accesses to be distributed evenly across the
    slices, with little bad effects from this cache organization.

    Given the memory access numbers of branchy and branchless above, I
    expected to see a speedup of branchy in those cases where branchless
    has a lot of L3 misses, so I decided to use npoints=100k and rep=200
    in the experiments further up. And my expectations turned out to be
    right, at least on these two machines.

    - anton


    Intuitively, I don't like measurements with so smal npoints parameter,
    but objectively they are unlikely to be much different from 1M points.
    Looking at your results do not differ radically from those we do on
    RWT forum.

    CPU LLC Cross Point sz/LLC
    TGL 8 12.8 12.8
    CFL 12 30.0 15.0
    Zen4 16 12.8 6.4
    M3 24 200.0 66.7
    RPT 33 1300.0 315.2

    The last line is Intel Raptor Lake (i7-14700). On this CPU I had to use
    1/3rd of installed memory in order to find a cross point.
    Here at vecsize=41M, which is ten times bigger than LLC, branchless is
    still almost 1.5x faster than branchy (0.216 vs 0.315 usec/point)

    In all cases ratio of vector size at cross-point to LLC size is much
    bigger than 1.
    On Tiger Lake ratio is smaller than on others, probably because of
    Intel's slow "mobile" memory controller, but even here it is
    significantly above 1.

    So far I don't see how your measurements disprove my theory of page
    table not fitting in L2 cache.







    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Tue Feb 24 07:51:13 2026
    From Newsgroup: comp.arch

    On 2/24/2026 3:25 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth
    bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table),
    with a (say 300 byte) record for each of its many thousands of employees
    each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.


    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.

    Yes, I miscalculated. My intent was to force a DRAM access for each
    record, which would make the problem worse (DRAM access time versus L3
    access time). But I think the same issue would apply, even it it fits
    in an L3 cache, but if it doesn't, increase the record size or number of records so that it doesn't fit in L3. But this just changes the number
    of prefetches in process needed to prevent it from becoming latency bound.

    Thanks.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Tue Feb 24 18:26:22 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 10:46:32 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 22 Feb 2026 13:17:30 GMT
    [email protected] (Anton Ertl) wrote:

    MitchAlsup <[email protected]d> writes:
    Array and Matrix scientific codes with datasets bigger than
    cache. =20
    =20
    The dense cases are covered by stride-based hardware
    predictors, so they are not "otherwise". I am not familiar
    enough with sparse scientific codes to comment on whether they
    are 1), 2), or "otherwise".
    =20

    BLAS Level 3 is not particularly external/LLC bandwidth intensive
    even without hardware predictors.

    There are HPC applications that are bandwidth limited; that's why
    they have the roofline performance model (e.g.,
    <https://docs.nersc.gov/tools/performance/roofline/>).

    Sure. But that's not what I would call "dense".
    In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.

    "Dense matrices" vs. "sparse matrices", not in terms of FLOPS/memory
    access.

    So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
    beyond that coming from the bandwidth limits (if any).

    Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
    but is memory bandwidth limited.

    Overwhelming majority of data served from
    L2 cache.
    That's with classic SIMD. It's possible that with AMX units it's
    no longer true.

    I very much doubt that. There is little point in adding an
    instruction that slows down execution by turning it from
    compute-bound to memory-bound.


    It does not slow down the execution. To the contrary, it speeds it up
    so much that speed of handling of L2 misses begins to matter.
    Pay attention that it's just my speculation that can be wrong.
    IIRC, right now binary64-capable AMX is available only on Apple
    Silicon (via SME) and may be on IBM Z. I didn't play with either.

    AMX is an Intel extension of AMD64 (and IA-32); ARM's SME also has
    "matrix" in its name, but is not AMX.

    Apple's extension was called AMX back when it was accessible only
    through close-source library calls and had no instruction-level
    documentation. Later on Apple pushed it through Arm's standardization
    process and since than everybody call it SME. But it is the same thing.

    Looking at
    <https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions>, it seems
    to me that the point of AMX is to deal with small matrices (16x64
    times 64x16 for Int8, 16x32 times 32x16 for 16-bit types) of small
    elements (INT8, BF16, FP16 and complex FP16 numbers) in a special
    unit. Apparently the AMX unit in Granite Rapids consumes 2048 bytes
    in 16 cycles, i.e., 128 bytes per cycle and produces 256 or 512 bytes
    in these 16 cycles. If every of these matrix multiplication happens
    happens on its own, the result will certainly be bandwidth-bound to L2
    and maybe already to L1. If, OTOH, these operations are part of a
    larger matrix multiplication, then cache blocking can probably lower
    the bandwidth to L2 enough, and reusing one of the operands in
    registers can lower the bandwidth to L1 enough.


    The question is not whether bandwidth from L2 to AMX is sufficient. It
    is.
    The interesting part is what bandwidth from LLC *into* L2 is needed in
    scenario of multiplication of big matrices.
    Supposedly 1/3rd of L2 is used for matrix A that stays here for very
    long time, another 1/3rd holds matrix C that remains here for, say, 8 iterations and remaining 1/3rd is matrix B that should be reloaded on
    every iteration. AMX increases the frequency at which we have to reload
    B and C.
    Assuming L2 = 2MB and square matrices , N = sqrt(2MB/8B/3) = 295,
    rounded down to 288. Each step takes 288**3= 24M FMA operations. If
    future AMX does 128 FMA per clock then the iteration take 188K clocks.
    288**2 * 8B / 188K = 3.5 B/clock. That's easy for 1 AMX/L2 combo, but
    very hard when you have 64 units like those competing on the same
    LLC.
    Yes, suggestion of 128 FMA/clock was exaggerated, but 64 is realistic
    and 32 is an absolute minimum. Building double-precision capable AMX is
    not worth the trouble if it can not do at least 32 FMAs.

    In any case, Intel will certainly not add hardware that exceeds the
    bandwidth boundaries in all common use cases.

    - anton

    I wouldn't be so sure.
    In relatively recent history Intel released Knights Landing which was unbalanced to extreme. Neither instruction decoders nor register ports
    nor caches were sufficient to feed its vector units at any workload
    that even remotely resembled useful algorithm.










    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Tue Feb 24 17:30:31 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Tue, 24 Feb 2026 09:50:43 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:
    [...]
    I think that it was said more than once throughout the thread that all >measurements were taken with npoints=1M and rep=11.

    I obviously missed it, that's why I asked for the parameters earlier.

    And for small rep
    even-size median has non-negligible bias.

    Even-size median is the arithmetic mean between the (n-1)/2-lowest and
    the (n+1)/2-lowest result. What bias do you have in mind?

    Here are the results with npoints=1M reps=11:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030187 0.063023 0.038614 0.081332
    140000 0.029779 0.066112 0.034797 0.085259
    200000 0.037532 0.072898 0.045945 0.090583
    280000 0.036677 0.078795 0.042678 0.097386
    400000 0.045777 0.084639 0.055256 0.104249
    560000 0.043092 0.086424 0.055677 0.112252
    800000 0.050325 0.091714 0.078172 0.123927
    1120000 0.048855 0.095756 0.091508 0.141819
    1600000 0.061101 0.099660 0.133560 0.168403
    2240000 0.082970 0.113055 0.166363 0.205129
    3200000 0.120834 0.137753 0.211253 0.235155
    4500000 0.149028 0.160359 0.241716 0.268885
    6400000 0.178675 0.177070 0.294336 0.294931
    9000000 0.219919 0.195292 0.322921 0.326113
    12800000 0.240604 0.209397 0.380076 0.352707
    18000000 0.257400 0.224952 0.409813 0.388834
    25000000 0.293160 0.239217 0.472999 0.413559
    35000000 0.307939 0.257587 0.522223 0.450754
    50000000 0.346399 0.271420 0.595265 0.478713
    70000000 0.375525 0.286462 0.626350 0.516533
    100000000 0.401462 0.305444 0.699097 0.555928
    140000000 0.419590 0.322480 0.732551 0.611815
    200000000 0.451003 0.337880 0.854034 0.647795
    280000000 0.468193 0.359133 0.914159 0.729234
    400000000 0.506300 0.372219 1.150605 0.773266
    560000000 0.520346 0.390277 1.167567 0.851592
    800000000 0.556292 0.412035 1.799293 0.899768
    1100000000 0.572654 0.432598 1.861136 0.963649
    1600000000 0.618735 0.450526 2.062255 1.026560
    2000000000 0.629088 0.465219 2.132035 1.073728

    Intuitively, I don't like measurements with so smal npoints parameter,
    but objectively they are unlikely to be much different from 1M points.

    It's pretty similar, but the first point where branchy is faster is
    veclen=6.4M for the 8700G and still 12.8M for the 1135G7; on the
    1135G7 the 6.4M and 9M veclens are pretty close for both npoints.

    My current theory why the crossover is so late is twofold:

    1) branchy performs a lot of misspeculated accesses, which reduce the
    cache hit rate (and increasing the cache miss rate) already at
    relatively low npoints (as shown in <[email protected]>), and likely also for
    low veclen with high npoints. E.g., already npoints=4000 has a >2M
    fills from memory on the 8700G, and npoints=500 has >2M
    LLC-load-misses at 1135G7, whereas the corresponding numbers for
    branchless are npoints=16000 and npoints=4000. This means that
    branchy does not just suffer from the misprediction penalty, but also
    from fewer cache hits for the middle stages of the search. How strong
    this effect is depends on the memory subsystem of the CPU core.

    2) In the last levels of the larger searches the better memory-level parallelism of branchy leads to branchy catching up and eventually
    overtaking branchless, but it has to first compensate for the slowness
    in the earlier levels before it can reach crossover. That's why we
    see the crossover point at significantly larger sizes than L3.

    On Tiger Lake ratio is smaller than on others, probably because of
    Intel's slow "mobile" memory controller

    This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
    separate DIMM, so the memory controller may be faultless.

    So far I don't see how your measurements disprove my theory of page
    table not fitting in L2 cache.

    I did not try to disprove that; if I did, I would try to use huge
    pages (the 1G ones if possible) and see how that changes the results.

    But somehow I fail to see why the page table walks should make much of
    a difference.

    If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Tue Feb 24 22:22:01 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 17:30:31 GMT
    [email protected] (Anton Ertl) wrote:


    If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton

    No problems.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Tue Feb 24 17:32:45 2026
    From Newsgroup: comp.arch

    On 2/21/2026 4:56 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/21/2026 2:15 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/20/2026 5:49 PM, MitchAlsup wrote:
    ----------------------------

    There is a non-zero risk though when one disallows uses that are
    theoretically allowed in the ISA, even if GCC doesn't use them.

    This is why one must decode all 32-bits of each instruction--so that
    there is no hole in the decoder that would allow the core to do some-
    thing not directly specified in ISA. {And one of the things that make
    an industrial quality ISA so hard to fully specify.}}
    ---------------------

    Sometimes there is a tension:
    What is theoretically allowed in the ISA;
    What is the theoretically expected behavior in some abstract model;
    What stuff is actually used by compilers;
    What features or behaviors does one want;
    ...
    Whether your ISA can be attacked with Spectré and/or Meltdown;
    Whether your DFAM can be attacked with RowHammer;
    Whether your call/return interface can be attacked with:
    { Return Orienter Programmeing, Buffer Overflows, ...}

    That is; whether you care if your system provides a decently robust programming environment.

    I happen to care. Apparently, most do not.


    There is a way at least, as noted, to optionally provide some additional protection against buffer overflows (in a compiler that does not use
    stack canaries, eg, GCC).

    But, as-noted, it disallows AUIPC+JALR to use X1 in this way.
    Even if compiler output does generally use X5 for this case.


    Implementing RISC-V strictly as per an abstract model would limit both
    efficiency and hinder some use-cases.

    One can make an argument that it is GOOD to limit attack vectors, and
    provide a system that is robust in the face of attacks.


    This was a partial motivation for deviating from the abstract model.

    Deviating from the abstract model in some cases allows closing down
    attack vectors.


    Then it comes down to "what do compilers do" and "what unintentional
    behaviors could an ASM programmer stumble onto unintentionally".

    Nïeve at best.


    Possibly, but there are some things a case can be made for disallowing:
    Using X1 for things other than as a Link Register;
    Disallowing JAL and JALR with Rd other than X0 or X1;
    Disallowing most instructions, other than a few special cases, from
    having X0 or X1 as a destination.

    RISC-V has a lot of "Hint" instructions, but a case can be made for
    making many of them illegal (where trying to use them will result in an exception; rather than simply ignoring them).

    In some other cases, it may be justified to disallow (and generate an exception for) things which can be expressed in the ISA, technically,
    but don't actually make sense for a program to make use of (some amount
    of edge cases that result in NOPs, or sometimes non-NOP behaviors which
    don't actually make sense); but are more likely to appear in undesirable
    cases (such as the CPU executing random garbage as instructions).

    ...


    Say, for example, the normal/canonical RISC-V NOP can't be expressed
    without 0x00 (NUL) bytes, whereas many other HINT type instructions can
    be encoded without NUL bytes.

    If someone can't as easily compose a NUL-byte-free NOP-slide, it makes
    it harder to inject shell code via ASCII strings (as does hindering the ability to tamper with return addresses), avoiding casual use of RWX
    memory, etc.


    The JAL/JALR Rd=X0|X1 only case, is one of those cases where one can
    argue a use-case exists, but is so rarely used as to make its native
    existence in hardware (or in an ISA design) difficult to justify. In
    effect, supporting it in HW adds non-zero cost, programs don't actually
    use it, and it burns 4 bits of encoding space you aren't really getting
    back (and they could have used it for something more useful, say, making
    JAL has a 16MB range or something).

    While one can't change the encoding now, they can essentially just turn
    the generic case into a trap and call it done.

    ...



    Though. yes, even if one nails all this down, there are still often
    other (nor-memory) attack vectors (such as attacking program logic).

    Saw something not too long ago where there was an RCE exploit for some
    random system (which operated via an HTTP servers and HTTP requests; I
    think for "enterprise supply-chain stuff" or something), where the
    exploit was basically the ability to execute arbitrary shell commands
    via expressing them as an HTTP request (with said server effectively
    running as "setuid root" or similar).

    Or, basically, something so insecure that someone could (in theory) hack
    it by typing specific URLs into a web browser or something (or maybe
    using "wget" via a bash script).

    Like, something like:
    http : //someaddress/cgi-bin/system.cgi?cmd=sshd%20...
    (Say, spawn an SSH server so they can stop using HTTP requests).
    Leaving people to just ROFLOL about how bad it was...

    And, how did the original product operate?... Mostly by sending
    unsecured shell commands over HTTP.


    So, alas...


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Feb 25 07:33:06 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me >where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few >instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I have not looked closely at hardware prefetchers, but I am sure that
    their designers have taken into account how far they have to prefetch
    such that the loads they prefetch for usually see a short latency.

    But prefetchers may not play a role anyway: A large number of (more
    important) demand loads will probably suppress the prefetcher. But
    that's no problem: modern uarchs tend to support enough outstanding
    loads to utilize the full bandwidth available to the core. E.g., Zen4
    has 64GB/s bandwidth from one core (actually one CCX) to the memory controllers, i.e., it can deliver 1 cache line/ns to the core from
    RAM. It also has 88 entries in the load-execution queue, which are
    good to cover 88*1ns=88ns of latency, which is roughly the latency one
    sees from DDR5 DIMMs; with the higher-latency LPDDR5 a single core may
    not be able to make use of all available latency.

    If there are enough instructions between the loads such that the ROB
    fills before the load-execution queue, demand loads may not be able to
    fully utilize the bandwidth, but that leaves slots open for the
    prefetcher, so the core will still make use of the available
    bandwidth.

    Of course, once you reach the bandwidth limit, loads that queue up
    behind other loads for attention from the memory subsystem will see
    more latency than in an uncongested memory system, but one does not
    consider that to be latency-bound.

    But that's all theoretical considerations; you could try to write a
    program with these characteristics, and see how it performs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Feb 25 08:17:53 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Tue, 24 Feb 2026 10:46:32 GMT
    [email protected] (Anton Ertl) wrote:
    The question is not whether bandwidth from L2 to AMX is sufficient. It
    is.
    The interesting part is what bandwidth from LLC *into* L2 is needed in >scenario of multiplication of big matrices.
    Supposedly 1/3rd of L2 is used for matrix A that stays here for very
    long time, another 1/3rd holds matrix C that remains here for, say, 8 >iterations and remaining 1/3rd is matrix B that should be reloaded on
    every iteration. AMX increases the frequency at which we have to reload
    B and C.
    Assuming L2 = 2MB and square matrices , N = sqrt(2MB/8B/3) = 295,
    rounded down to 288. Each step takes 288**3= 24M FMA operations. If
    future AMX does 128 FMA per clock then the iteration take 188K clocks.
    288**2 * 8B / 188K = 3.5 B/clock. That's easy for 1 AMX/L2 combo, but
    very hard when you have 64 units like those competing on the same
    LLC.

    The 64 units are from 64 cores? I have not looked at the kind of
    bandwidth that the 2D grid interconnect between the cores and their L3
    cache slices offers to the cores of the big Xeons, but assuming that
    the bandwidth is insufficient for that, one will probably buy a chip
    with fewer cores if all that the cores are intended to do is to
    multiply matrices.

    Given that these cores are also used for applications other than
    matrix multiplication, and for some of those it makes sense to have
    that many cores (or more) even with that bandwidth limit to L3, I
    cannot fault Intel for building such CPUs. And I also cannot fault
    them for designing cores in a way that some hardware cannot be fully
    utilized by some applications in some of the CPU configurations that
    use that core. As long as there is an application where the hardware
    can be fully utilized for some CPU configuration, and the additional
    hardware benefits this application enough, the hardware feature
    technically makes sense (for commercial sense additional
    considerations apply).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Feb 25 15:07:23 2026
    From Newsgroup: comp.arch

    On Tue, 24 Feb 2026 17:30:31 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Tue, 24 Feb 2026 09:50:43 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Mon, 23 Feb 2026 08:06:20 GMT
    [email protected] (Anton Ertl) wrote:
    [...]
    I think that it was said more than once throughout the thread that
    all measurements were taken with npoints=1M and rep=11.

    I obviously missed it, that's why I asked for the parameters earlier.

    And for small rep
    even-size median has non-negligible bias.

    Even-size median is the arithmetic mean between the (n-1)/2-lowest and
    the (n+1)/2-lowest result. What bias do you have in mind?


    You had seen my code. It has no special handling for even nrep.
    It simply takes dt[nrep/2], like in odd case.


    Here are the results with npoints=1M reps=11:

    Zen4 8700G Tiger Lake 1135G7
    veclen branchless branchy branchless branchy
    100000 0.030187 0.063023 0.038614 0.081332
    140000 0.029779 0.066112 0.034797 0.085259
    200000 0.037532 0.072898 0.045945 0.090583
    280000 0.036677 0.078795 0.042678 0.097386
    400000 0.045777 0.084639 0.055256 0.104249
    560000 0.043092 0.086424 0.055677 0.112252
    800000 0.050325 0.091714 0.078172 0.123927
    1120000 0.048855 0.095756 0.091508 0.141819
    1600000 0.061101 0.099660 0.133560 0.168403
    2240000 0.082970 0.113055 0.166363 0.205129
    3200000 0.120834 0.137753 0.211253 0.235155
    4500000 0.149028 0.160359 0.241716 0.268885
    6400000 0.178675 0.177070 0.294336 0.294931
    9000000 0.219919 0.195292 0.322921 0.326113
    12800000 0.240604 0.209397 0.380076 0.352707
    18000000 0.257400 0.224952 0.409813 0.388834
    25000000 0.293160 0.239217 0.472999 0.413559
    35000000 0.307939 0.257587 0.522223 0.450754
    50000000 0.346399 0.271420 0.595265 0.478713
    70000000 0.375525 0.286462 0.626350 0.516533
    100000000 0.401462 0.305444 0.699097 0.555928
    140000000 0.419590 0.322480 0.732551 0.611815
    200000000 0.451003 0.337880 0.854034 0.647795
    280000000 0.468193 0.359133 0.914159 0.729234
    400000000 0.506300 0.372219 1.150605 0.773266
    560000000 0.520346 0.390277 1.167567 0.851592
    800000000 0.556292 0.412035 1.799293 0.899768
    1100000000 0.572654 0.432598 1.861136 0.963649
    1600000000 0.618735 0.450526 2.062255 1.026560
    2000000000 0.629088 0.465219 2.132035 1.073728

    Intuitively, I don't like measurements with so smal npoints
    parameter, but objectively they are unlikely to be much different
    from 1M points.

    It's pretty similar, but the first point where branchy is faster is veclen=6.4M for the 8700G and still 12.8M for the 1135G7; on the
    1135G7 the 6.4M and 9M veclens are pretty close for both npoints.


    Here are my Raptor Lake results (Raptor Cove and Gracemont)
    i7-14700-P i7-14700-E
    veclen branchless branchy branchless branchy
    100000 0.025580 0.065588 0.055901 0.084580
    140000 0.022006 0.067718 0.055513 0.088897
    200000 0.029240 0.071393 0.066058 0.092738
    280000 0.026032 0.074760 0.065444 0.096932
    400000 0.034791 0.082914 0.076366 0.101067
    560000 0.033920 0.089877 0.078420 0.107085
    800000 0.046743 0.098518 0.086972 0.115900
    1120000 0.045401 0.108028 0.088823 0.125935
    1600000 0.059368 0.119298 0.112489 0.139820
    2240000 0.060052 0.127647 0.121007 0.154747
    3200000 0.081294 0.140325 0.151804 0.173241
    4500000 0.090567 0.158380 0.169017 0.198011
    6400000 0.134527 0.184986 0.221871 0.228213
    9000000 0.139166 0.216010 0.247091 0.265844
    12800000 0.158295 0.244137 0.314549 0.302832
    18000000 0.198464 0.275680 0.342068 0.336601
    25000000 0.232103 0.296431 0.425518 0.369543
    35000000 0.249135 0.318954 0.431165 0.404154
    50000000 0.291809 0.342103 0.537924 0.441044
    70000000 0.312596 0.366850 0.548441 0.473681
    100000000 0.366150 0.390468 0.672164 0.511968
    140000000 0.365590 0.418052 0.684060 0.547667
    200000000 0.414727 0.453255 0.814007 0.592588
    280000000 0.470445 0.476864 0.841394 0.642071
    400000000 0.497190 0.511605 0.975004 0.702385
    560000000 0.524168 0.545820 1.006532 0.771940
    800000000 0.555521 0.577105 1.178287 0.857243
    900000000 0.587890 0.595592 1.197315 0.881594
    1000000000 0.605113 0.605211 1.221586 0.901517
    1100000000 0.582061 0.632543 1.228431 0.939851
    1200000000 0.623949 0.641601 1.262782 0.953194
    1300000000 0.653287 0.649376 1.283567 0.971780
    1400000000 0.758616 0.651059 1.363853 0.990429
    1500000000 0.675992 0.656111 1.387015 1.009316
    1600000000 0.686154 0.680118 1.424667 1.025686
    1700000000 0.692339 0.693832 1.428070 1.036490
    1800000000 0.703114 0.694311 1.440372 1.054245
    1900000000 0.693015 0.690148 1.446149 1.063155
    2000000000 0.688864 0.714125 1.468993 1.088992

    And here is Coffee Lake
    Xeon E-2176G
    veclen branchless branchy
    100000 0.047010 0.091787
    140000 0.045510 0.097623
    200000 0.055983 0.104018
    280000 0.055245 0.109927
    400000 0.067691 0.116034
    560000 0.070107 0.121319
    800000 0.082146 0.127841
    1120000 0.086229 0.138181
    1600000 0.111000 0.153429
    2240000 0.120780 0.175006
    3200000 0.151256 0.199086
    4500000 0.164724 0.223560
    6400000 0.202989 0.244470
    9000000 0.215630 0.268854
    12800000 0.268318 0.290898
    18000000 0.281772 0.315263
    25000000 0.340269 0.336805
    35000000 0.350421 0.360242
    50000000 0.426758 0.385225
    70000000 0.432946 0.411350
    100000000 0.509785 0.441351
    140000000 0.528119 0.472827
    200000000 0.618676 0.510482
    280000000 0.643984 0.548418
    400000000 0.747189 0.590660
    560000000 0.782344 0.628088
    800000000 0.897024 0.671247
    900000000 0.911065 0.682892
    1000000000 0.930987 0.692015
    1100000000 0.920161 0.708865
    1200000000 0.940487 0.718227
    1300000000 0.966389 0.728277
    1400000000 0.984855 0.737065
    1500000000 1.025050 0.745294
    1600000000 1.048256 0.756237
    1700000000 1.064021 0.761571
    1800000000 1.063617 0.765608
    1900000000 1.068122 0.773536
    2000000000 1.074109 0.778483


    My current theory why the crossover is so late is twofold:

    1) branchy performs a lot of misspeculated accesses, which reduce the
    cache hit rate (and increasing the cache miss rate) already at
    relatively low npoints (as shown in <[email protected]>), and likely also for
    low veclen with high npoints. E.g., already npoints=4000 has a >2M
    fills from memory on the 8700G, and npoints=500 has >2M
    LLC-load-misses at 1135G7, whereas the corresponding numbers for
    branchless are npoints=16000 and npoints=4000. This means that
    branchy does not just suffer from the misprediction penalty, but also
    from fewer cache hits for the middle stages of the search. How strong
    this effect is depends on the memory subsystem of the CPU core.

    2) In the last levels of the larger searches the better memory-level parallelism of branchy leads to branchy catching up and eventually
    overtaking branchless, but it has to first compensate for the slowness
    in the earlier levels before it can reach crossover. That's why we
    see the crossover point at significantly larger sizes than L3.

    On Tiger Lake ratio is smaller than on others, probably because of
    Intel's slow "mobile" memory controller

    This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
    separate DIMM, so the memory controller may be faultless.


    Do you mean that dealing with two different types of memory is
    objectively hard job?
    Or that latency of 1135G7 is not to bad?
    I can agree with the former, but not with the latter.
    "Branchles" results for veclen in range 2M-10M that should be
    good indication of latency of main RAM are just not good on
    this gear. And slop of duration vs log(veclen) graph taken at this
    range, which is probably even better indication of memory latency,
    is also not good.
    It all looks lake memory latency on this TGL is ~90ns, much higher than
    the rest of them. That is, except for Gracemont, where very high
    measured latency is probably cupped by some sort of power-saving policy
    and not related to memory controller itself.

    So far I don't see how your measurements disprove my theory of page
    table not fitting in L2 cache.

    I did not try to disprove that; if I did, I would try to use huge
    pages (the 1G ones if possible) and see how that changes the results.


    That sounds like an excellent idea.

    But somehow I fail to see why the page table walks should make much of
    a difference.

    If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton

    Does not SW prefetch turn itself into NOP on TLB miss?
    I remember reading something of that sort, but don't remember what CPU
    was discussed.
    Unfortunately, documentation rarely provide this sort of details.





    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Wed Feb 25 18:32:25 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Tue, 24 Feb 2026 17:30:31 GMT
    [email protected] (Anton Ertl) wrote:
    This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
    separate DIMM, so the memory controller may be faultless.


    Do you mean that dealing with two different types of memory is
    objectively hard job?

    I think that this setup is likely to provide less memory-level
    parallelism, because probably 80% of the accesses go to only one DRAM
    channel (on the 32GB DIMM), whereas on the 8700G the accesses
    distribute evenly across 4 channels.

    Or that latency of 1135G7 is not to bad?

    I have not measured that before, let's see: Using bplat (pointer
    chasing in a randomized cycle of pointers), I see for the 8700G
    machine with DDR5-5200 and this 1135G7 machine the following
    latencies (in ns):

    size (B) 8700G 1135G7
    1024 0.8 1.2
    2048 0.8 1.2
    4096 0.8 1.2
    8192 0.8 1.2
    16384 0.8 1.2
    32768 0.8 1.2
    65536 2.8 3.3
    131072 2.8 3.4
    262144 2.8 3.4
    524288 3.6 4.2
    1048576 4.5 5.0
    2097152 10.0 11.9
    4194304 10.4 14.6
    8388608 10.3 34.9
    16777216 25.1 72.9
    33554432 77.4 90.4
    67108864 86.8 93.6

    So the main memory latency does not look much worse for the 1135G7
    than for the 8700G. However, in <https://www.complang.tuwien.ac.at/anton/bplat/Results> there are a
    number of machines with significantly better main memory latencies; interestingly, the best result comes from a Rocket Lake machine, a
    close relative of the Tiger Lake in the 1135G7. So you may be right
    about Intel's mobile memory controller configuration.

    It all looks lake memory latency on this TGL is ~90ns

    Right on target:-)

    If I find the time, I would like to see how branchless with software
    prefetching performs. And I would like to put all of that, including
    your code, online. Do I have your permission to do so?

    - anton

    Does not SW prefetch turn itself into NOP on TLB miss?

    I don't know. If the prefetch is for something that will actually be
    fetched, a TLB miss would make the prefetch even more urgent.

    What I expect is that a prefetch to a virtual address that cannot be
    read becomes a noop. Of course, if you are a hardware engineer under
    time pressure, you may take the shortcut of considering the absence
    from the TLB to be an indication of unaccessability.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Wed Feb 25 17:40:11 2026
    From Newsgroup: comp.arch

    On 2/23/2026 4:17 PM, Paul Clayton wrote:
    On 2/18/26 3:45 PM, BGB wrote:

    snip, not feeling up to responding to everything at the moment.

    So, for now, reevaluating things more narrowly.


    I do not know if 5/6-bit state machines have been academically
    examined for predictor entries. I suspect the extra storage is a
    significant discouragement given one often wants to cover more
    different correlations and branches.


    If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
    counters, it can be a win.

    I suspect it very much depends on whether bias or pattern is
    dominant. This would depend on the workload (Doom?) and the
    table size (and history length). I do not know that anyone in
    academia has explored this, so I think you should be proud of
    your discovery even if it has limited application.

    A larger table (longer history) can mean longer training, but
    such also discovers more patterns and longer patterns (e.g.,
    predicting a fixed loop count). However, correlation strength
    tends to decrease with increasing distance (having multiple
    history lengths and hashings helps to find the right history).


    Yeah.

    I just decided to go off and to some slightly more extensive testing.


    As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.

    When the pattern is exactly repeated this is great, but if the
    correlation with global history is fuzzy (but biased) a counter
    might be better.

    I get the impression that branch prediction is complicated
    enough that even experts only have a gist of what is actually
    happening, i.e., there is a lot of craft and experimentation and
    less logical derivation (I sense).


    I decided to try comparing a few options by sampling branch data and
    then testing against it offline.

    Though, the current empirically-sampled data is limited in scope
    (currently covering running the Doom game loop, and with enough
    sample-points to cover roughly 2 seconds of running time).

    Even for a 50 MHz CPU, sampling more than a few seconds worth of
    branches would eat a significant chunk of RAM and make for a fairly
    large file.


    Recording the startup process, launching Doom, and running part of the
    game loop, while preferable, will take up too much space (if one is
    recording both the PC address and branch direction).

    In some other past contexts, it looked like the 5/6 bit FSM was the
    clear winner. But, with Doom samples and offline testing, not so clear.


    Here, context is the low-order bits of PC XORed with the recent branch history.


    Miss rate from the collected data (with a 6-bit context):
    1-bit, last bit : ~ 19.14%
    2-bit, sat counter : ~ 14.87% (2nd place)
    3-bit, sat counter : ~ 18.24%
    5/6-bit, GA evolved : ~ 14.25% (winner)

    3/4-bit, alternating (hand): ~ 35.95% (bad)
    4/5-bit, 2/3 bit pat (hand): ~ 38.75% (bad)
    5/6 (hand) : Untested (too much effort)
    (Will leave out the hand-written FSMs as they seem to suck).
    ...

    If context is reduced to 5 bits:
    1-bit, last bit : ~ 30.55%
    2-bit, sat counter : ~ 17.37% (2nd place)
    3-bit, sat counter : ~ 21.69%
    5/6-bit FSM : ~ 15.86% (winner)

    If context is reduced to 4 bits:
    1-bit, last bit : ~ 30.55%
    2-bit, sat counter : ~ 22.15% (2nd place)
    3-bit, sat counter : ~ 29.04%
    5/6-bit FSM : ~ 19.25% (winner)


    If context is increased to 8 bits:
    1-bit, last bit : ~ 13.07%
    2-bit, sat counter : ~ 12.32% (winner)
    3-bit, sat counter : ~ 13.96%
    5/6-bit FSM : ~ 12.79% (2nd place)

    If context is increased to 12 bits:
    1-bit, last bit : ~ 9.46% (winner)
    2-bit, sat counter : ~ 9.85% (2nd place)
    3-bit, sat counter : ~ 10.38%
    5/6-bit FSM : ~ 10.70%


    So, it seems increasing the context size causes the FSM to lose its
    advantage (mostly as more of the context seems to be dominated by
    single-bit patterns; and accuracy based on how quickly it can adapt).

    the 3-bit saturating counter here is nearly always worse than the 2-bit saturating counter.


    If optimizing for "most effectiveness per bit of storage":
    8-bit context + 2-bit sat counter would win;
    If optimizing for FPGA resource cost: 5 and 6 bit contexts would win.



    Did go any modify the GA table evolver to also use the sampled data as reference for the initial evolver step (along with the original purely synthetic patterns). This appears to have maybe helped.

    Slightly faster adaptation and noise tolerance; but margins seem small
    (can't rule out the possibility of the difference being due to
    RNG/"butterfly effect" or similar, rather than due to training on "real
    world" data vs purely synthetic patterns).


    With 6-bit context, sampled data vs pure synthetic:
    5/6-bit, GA evolved (real) : ~ 14.25% (winner)
    5/6-bit, GA evolved (synth): ~ 15.33%

    Where, the inclusion of real sampled data in the training process does
    appear to improve accuracy over a version trained purely on synthetic patterns.



    The using a genetic algorithm to generate the tables seems to vastly outperform my ability to naively hand-fill the tables.

    I suspect this is because the GA is able to generate tables which adapt
    to changing patterns more quickly.


    The hand-filled tables tend to need a to get back to "hub states" which
    then migrate to the target state for the pattern. The GA-generated
    tables seem able to side-step the need for hub-states and adapt to the
    change in pattern within around a few bits (if incrementing the
    bit-pattern, is seems to have a 2-bit lead-in).


    But, the need for hub states does put the hand-filled patterns at a disadvantage relative to the saturating counters, which as noted, does
    not apply to whatever the genetic-algorithm is doing.

    So, it seems like the hand-written tables are at a serious disadvantage
    here.


    I did verify that it was still able to adapt to arbitrary repeating 3
    and 4 bit patterns (and accurately express these patterns).

    with the main variability being that the GA-evolved patterns seem to
    adapt more quickly and are also better able to tolerate "noise".


    But, alas, this doesn't exactly mean that a 5/6 bit FSM is a clear
    winner over a 2-bit saturating counter in the general case.

    Comparing against branch values from initial boot process (6b ctx):
    1-bit, last bit : ~ 11.39%
    2-bit, sat counter : ~ 6.48%
    3-bit, sat counter : ~ 33.86%
    5/6-bit, GA evolved (real) : ~ 4.90%
    5/6-bit, GA evolved (synth): ~ 5.02%


    Granted, this was the context that originally led me to choose the 5/6
    bit FSM as the clear winner over the 2-bit saturating counter.


    And, for reasons, I mostly ended up not using hand-written FSMs in any
    of these cases as, I don't understand whatever black arts the GA is
    using here, nor can I replicate the same level of performance (also,
    even if I have since figured out how to do so, writing out such an FSM
    by hand is still a PITA).

    So, alas, the use of a genetic algorithm seems superior at this task...


    In this case, a C version of the 5/6 bit FSM looking like:
    static byte fsmtab_5b[64]={
    0x32, 0x20, 0x3E, 0x3B, 0x18, 0x02, 0x31, 0x13,
    0x0C, 0x10, 0x06, 0x16, 0x1E, 0x24, 0x06, 0x12,
    0x0F, 0x0B, 0x01, 0x2B, 0x34, 0x19, 0x1D, 0x2C,
    0x33, 0x37, 0x26, 0x1B, 0x0D, 0x11, 0x08, 0x03,
    0x0D, 0x39, 0x3D, 0x2A, 0x01, 0x13, 0x26, 0x37,
    0x34, 0x36, 0x3F, 0x2D, 0x15, 0x21, 0x0E, 0x2B,
    0x29, 0x2E, 0x38, 0x04, 0x0C, 0x25, 0x00, 0x17,
    0x30, 0x2F, 0x09, 0x03, 0x14, 0x05, 0x3D, 0x23,
    };

    With the input/output data bit in the LSB (other 5 bits being state).


    ...


    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Thu Feb 26 09:08:29 2026
    From Newsgroup: comp.arch

    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand,
    doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a
    track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Feb 26 14:54:07 2026
    From Newsgroup: comp.arch

    On 2/22/2026 3:52 PM, John Dallman wrote:
    In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:

    Does imply that my younger self was notable, and not seen as just
    some otherwise worthless nerd.

    Educators who are any good notice the weird kids who are actually smart.


    Sometimes I question if I really am though.

    Like, some evidence says I am, but by most metrics of "life success" I
    have done rather poorly.


    And, in middle and high-school, they just sorta forced me to sit through normal classes (which sucked really hard). Well, and I apparently missed
    the point of school, thinking it was more of an endurance thing with
    sort of a vague pretense of education (and I probably would have learned
    more if they just let me spend the time doing whatever else).

    ...



    But, it seems like a case of:
    By implication, I am smart, because if I wasn't, even my own (sometimes pointless) hobby interests would have been out of reach.

    Like, not a world of difficulty justifying them, or debating whether or
    not something is worth doing, but likely not something someone could do
    at all.


    Or, maybe, like encountering things that seem confusing isn't such a
    rare experience (or that people have learned how to deal more
    productively with things they can see but don't understand?...).


    But, there is a thing I have noted:
    I had a few times mentioned to people about finding that certain AIs had gotten smart enough to start understanding how a 5/6 bit finite state
    machine to predict repeating 1-4 bit patterns would be constructed.

    Then, I try to describe it, and then realize that for the people I try
    to mention it to, it isn't that they have difficulty imagining how one
    would go about filling in the table and getting all of the 4 bit
    patterns to fit into 32 possible states. Many seem to have difficulty understanding how such a finite state machine would operate in the first place.


    Even if, it seems like this part seems like something that pretty much
    anyone should be able to understand.

    Initially, I had used this as a test case for the AIs because it posed "moderate difficulty" for problems which could be reasonably completely described in a chat prompt (and is not overly generic).

    Nevermind if it is still a pain to generate tables by hand, and my
    attempts at hand-generated tables have tended to have worse adaptation
    rates than those generated using genetic algorithms (can be more clean looking, but tend to need more input bits to reach the target state if
    the pattern changes).


    Sometimes I feel like a poser.
    Other things, it seems, I had taken for granted.

    Seems sometimes if I were "actually smart", would have figured out some
    way to make better and more efficient use of my span of existence.


    For 128 predicate registers, this part doesn't make as much sense:

    I suspect they wanted to re-use some logic.

    The tricks Itanium could do with combinations of predicate registers were pretty weird. There was at least one instruction for manipulating them
    which I was entirely unable to understand, with the manual in front of me
    and pencil and paper to try examples. Fortunately, it never occurred in
    code generated by any of the compilers I used.


    Possibly.

    I had also looked into a more limited set of predicate registers at one
    point, but this fizzled in favor of just using GPRs.

    So, as noted:
    I have 1 predicate bit (T bit);
    Had looked into expanding it to 2 predicate bits (using an S bit as a
    second predicate), but this went nowhere.


    Had at another time looked into schemes for having a combination of 8x
    1-bit predicate registers with operations that could update the T bit.
    My initial attempt was an x87 style stack machine, and this was a fail.
    A later design attempt would have added U0..U7 as 8x 1-bit registers.

    Though, just ended up instead going with GPRs for this (following a
    pattern more like RISC-V). Though, in XG3, some operation can be
    directed at R0/XO to update the T bit.


    In RV-like terms:
    SLT, SGE, SEQ, SNE, SLTU, SGEU
    AND, OR //more recent
    Where, 'AND' partly takes over the role of the 2R "TST" instruction.
    AND X0, X10, X11

    Though, for now using AND/OR directed to X0 for bitwise predication will
    be specific to XG3 encodings.

    Say, because someone in their great wisdom decided to use ORI and
    similar directed to X0 in the RISC-V encoding space to encode the
    prefetch instructions.

    Personally, I would have used, say:
    LB X0, Disp(Xs)
    Or similar, since presumably any sane prefetch needs to be able to
    access the memory it is prefetching from, and load-as-prefetch makes
    more sense to me than ORI as prefetch, but alas...

    Then again, LHU/LWU encoding with X0 as an implicit branch for an
    optional feature is similarly suspect (and carries the risk of "what if someone else puts some other behavior here"?...).


    *1: Where people argue that if each vendor can do a CPU with their
    own custom ISA variants and without needing to license or get
    approval from a central authority, that invariably everything would
    decay into an incoherent mess where there is no binary
    compatibility between processors from different vendors (usual
    implication being that people are then better off staying within
    the ARM ecosystem to avoid RV's lawlessness).

    The importance of binary compatibility is very much dependent on the
    market sector you're addressing. It's absolutely vital for consumer apps
    and games. It's much less important for current "AI" where each vendor
    has their own software stack anyway. RISC-V seems to be far more
    interested in the latter at present.


    Probably true...


    Likely also the space of customized CPU design / experimentation is much
    more accepting of fragmentation, where in more mainline "user oriented" hardware, it would be a bigger issue.

    Still I sit around waiting to see if the whole RISC-V indexed load thing (zilx) becomes an actual extension.

    In my working version, did end up going and implementing support for it
    within BGBCC (and in my emulator and CPU core), but am still partly
    waiting on whether it gains actual approval from the ARC.

    Most recent news I saw was basically one of the people involved
    complaining that it would show no significant performance benefit for
    SPEC running on OoO processor implementations.

    Say, vs Doom on an in-order CPU, where it makes a much bigger difference.

    Sometimes, maybe, SPEC on high-end CPUs should not be the primary
    arbiter (more so when most of the CPUs are likely to end up in segments
    where in-order tends to dominate).

    ...

    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Thu Feb 26 14:54:53 2026
    From Newsgroup: comp.arch

    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth
    bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table),
    with a (say 300 byte) record for each of its many thousands of employees
    each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips. Seemingly
    about the only real hope the CPU has is to have a large cache and just
    hope that the data happens to be in the cache (and has been accessed previously or sufficiently recently) else it is just kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this personally.

    Rather, as can be noted, memory access patterns can often make a fairly
    large impact on the performance of some algorithms.


    Like, for example, decoding a PNG like format vs a JPG like format:
    PNG decoding typically processes the image as several major phases:
    Decompress the Deflate-compressed buffer into memory;
    Walk over the image, running scanline filters,
    copying scanlines into a new (output) buffer.

    Even if the parts, taken in isolation, should be fast:
    The image buffers are frequently too large to fit in cache;
    Cache misses tend to make PNG decoding painfully slow,
    even when using faster filters.
    If using the Paeth filter though, this adds extra slowness,
    due to branch-predictor misses.
    On targets like x86,
    the filter is frequently implemented using branches;
    The branch miss rate is very high.
    So, a naive branching version, performs like dog crap.

    So, net result: Despite its conceptual simplicity, PNG's decode-time performance typically sucks.

    Contrast, a decoder for a JPEG like format can be made to process one
    block at a time and go all the way to final output. So, JPEG is often
    faster despite the more complex process (with transform stages and a colorspace transform).


    The Paeth filter slowness does seem a little odd though:
    Theoretically, a CPU could turn a short forward branch into predication;
    But, this doesn't tend to be the case.

    It then is faster to turn the filter into some convoluted mess of
    arithmetic and masking in an attempt to reduce the branch mispredict costs.

    ...

    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Fri Feb 27 09:52:46 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically >increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, >doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to
    increase the MLP. Concerning a reduction in latency, that would not
    increase performance, because this application is already working at
    the bandwidth limit.

    I feel the urge to write up a mock variant of your use case and
    measure whether reality confirms my expectations, but currently have
    no time for that.

    But let's take a slightly simpler case for which I already have a
    program: Walk through a 32MB (not MiB in this case) array on a Zen4
    with 16MB L3 linearly with different strides, repeatedly with 100M
    total accesses; the loads in this case are dependent (i.e., pointer
    chasing), but at least for some of the strides that may not matter,
    because the bandwidth limits the performance:

    for i in 128 64 56 48 40 32; do LC_NUMERIC=prog perf stat -e cycles:u ./memory1 linear $[32000000/$i] $i; done

    The resulting cycles for 100M memory accesses are:

    stride cycles:u s (user) bandwidth
    128 3_338_916_538 0.666553000 9.6B/ns
    64 480_517_794 0.100506000 63.7B/ns
    56 457_623_521 0.094127000 59.5B/ns
    48 443_545_135 0.086447000 55.5B/ns
    40 436_435_599 0.086920000 46.0B/ns
    32 427_891_293 0.089242000 35.9B/ns

    The bandwidth is computed based on 64B cache lines, so with stride up
    to 64, stride*100MB are accessed, while with stride>128, 64*100MB are
    accessed.

    The limits for this hardware are 64B/ns bandwidth (from the
    interconnect between the CCX and the rest of the system), and 4 cycles
    minimum load latency (resulting in at least 400M cycles for the 100M
    dependent memory accesses). Ideally one would see something like (at
    5GHz):

    stride cycles:u s (user) bandwidth
    128 500_000_000 0.1 64B/ns
    64 500_000_000 0.1 64B/ns
    56 437_500_000 0.0875 64B/ns
    48 400_000_000 0.08 60B/ns
    40 400_000_000 0.08 50B/ns
    32 400_000_000 0.08 40B/ns

    In practice we see values close to this only for stride=64. For
    smaller strides, the cycles and seconds are a little higher, and the
    bandwidth a little lower.

    For stride=128, the results are much worse. Apparently the memory
    controllers are very good when the accessed lines are successive, but
    much worse when there are holes between the accesses.

    For your scenario with 300B records and 1-2 accesses inside them, we
    will probably see similar slowdowns as for the stride=128 case. I
    have no time for trying this benchmark with only one DIMM (which would
    reduce the maximum bandwidth to about 41GB/s), but given the actual
    observed bandwidth, I doubt that this would change much for the
    stride=128 case.

    Concerning reducing the latency, that obviously would not help in the
    stride=64 case (it is at the bandwidth limit), but for stride=128 it
    might help. Working with a 12MB working set gives:

    stride cycles:u s (user) bandwidth
    192 710_625_989 0.146530000 43.7B/ns
    128 733_132_291 0.154726446 41.4B/ns
    64 425_149_972 0.094751987 67.5B/ns
    56 417_703_739 0.093068039 60.2B/ns
    48 415_704_043 0.092887846 51.7B/ns
    40 415_855_110 0.091790911 43.6B/ns
    32 406_409_135 0.091095578 35.1B/ns

    So even with only L3 involved, stride>64 sees a slowdown, although a
    much slower one one. The stride <= 64 cases are relatively close to
    the limit of 400M cycles, but the clock speed is relatively low (maybe
    due to the power limit of 25W).

    Using independent instead of dependent accesses would make this
    benchmark much closer to your use case, and would make it possible to
    use MLP for demand loads rather than just for hardware prefetchers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Feb 27 18:55:23 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    I agree:

    Many times, increasing memory BW causes a BigO(Ln(BW)) increase in memory latency. For example, when a supercomputer multiplies the number of memory banks, it adds a clock of latency between the CPUs and memory, and adds
    a second clock of delay on the way back. 16× the memory banks would add another 2 clocks.

    If adding banks causes any degradation in performance you are, by defi-
    nition, latency bound on the application losing performance.

    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.



    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Feb 27 19:04:42 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated? Tree searching and binary search in arrays come to
    mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth >> bound.

    If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems. For example, imagine a company employee file (table), >> with a (say 300 byte) record for each of its many thousands of employees >> each containing typical employee stuff). Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department. With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next
    record. This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem. If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips.

    There is no ILP when you are sitting waiting on memory.

    Seemingly about the only real hope the CPU has is to have a large cache and just
    hope that the data happens to be in the cache (and has been accessed previously or sufficiently recently) else it is just kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this personally.

    I built such a memory system (circa 2003) and it worked wonderfully
    on first 1B cycles when the application was building its data set for
    the first time. Later, as the DAG was manipulated, there were no
    predictors that helped <much>.

    Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.

    Which is why serious numerics code are written for specific transpose
    orders. See DGEMM as an example.

    Like, for example, decoding a PNG like format vs a JPG like format:
    PNG decoding typically processes the image as several major phases:
    Decompress the Deflate-compressed buffer into memory;
    Walk over the image, running scanline filters,
    copying scanlines into a new (output) buffer.

    FFT has the property that, sooner or later, every next fetch of the
    matrix entry takes a cache miss. Sometimes this is at the beginning
    (Decimation in time) sometimes at the end (Decimation in frequency)
    sometimes in the middle (access pattern is congruent to cache size).

    With matrixes of just the right size, one can achieve a TLB miss on
    every 8th access.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Feb 27 19:27:21 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 2/22/2026 3:52 PM, John Dallman wrote:
    In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:

    Does imply that my younger self was notable, and not seen as just
    some otherwise worthless nerd.

    Educators who are any good notice the weird kids who are actually smart.


    Sometimes I question if I really am though.

    Like, some evidence says I am, but by most metrics of "life success" I
    have done rather poorly.


    And, in middle and high-school, they just sorta forced me to sit through normal classes (which sucked really hard)

    In my case, I remember sitting in the back of advanced algebra class
    (mostly senior HS people, me a sophomore) doing chemistry homework while vaguely listening to the teacher fail to get various students to solve
    a typical algebra problem. Then she called on me, I looked up at the board
    and in less than a second I rattled off the answer skipping 5 steps along
    the way. Moral, don't be bored in class, do something useful instead.

    Well, and I apparently missed
    the point of school, thinking it was more of an endurance thing with
    sort of a vague pretense of education (and I probably would have learned more if they just let me spend the time doing whatever else).

    For most people, school attempts to give the students just enough knowledge that they are not burdens on society.
    -------------------------
    The tricks Itanium could do with combinations of predicate registers were pretty weird. There was at least one instruction for manipulating them which I was entirely unable to understand, with the manual in front of me and pencil and paper to try examples. Fortunately, it never occurred in code generated by any of the compilers I used.

    It could have been a case where the obvious logic decoding "that" field in
    the instruction allowed for "a certain pattern" to perform what they described in the spec. I did some of this in Mc 88100, and this is what taught me never to do it again or allow anyone else to do it again.

    Possibly.

    I had also looked into a more limited set of predicate registers at one point, but this fizzled in favor of just using GPRs.

    So, as noted:
    I have 1 predicate bit (T bit);
    Had looked into expanding it to 2 predicate bits (using an S bit as a
    second predicate), but this went nowhere.

    I have tried several organizations over the last 40 years of practice::
    In my Humble and Honest Opinion, the only constructs predicates should
    support are singular comparisons and comparisons using && and || with deMoganizing logic {~}--not because other forms are unuseful, but be-
    cause those are the constructs programmers use writing code.

    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Fri Feb 27 19:31:41 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> schrieb:

    With matrixes of just the right size, one can achieve a TLB miss on
    every 8th access.

    Which is why people copy parts of the matrices they multiply
    into separate blocks. If the sizes fit the cache hierarchy,
    it is an excellent tradeoff even though the number operations
    nominally increases.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Fri Feb 27 19:57:45 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:



    And, in middle and high-school, they just sorta forced me to sit through
    normal classes (which sucked really hard)

    In my case, I remember sitting in the back of advanced algebra class
    (mostly senior HS people, me a sophomore) doing chemistry homework while >vaguely listening to the teacher fail to get various students to solve
    a typical algebra problem. Then she called on me, I looked up at the board >and in less than a second I rattled off the answer skipping 5 steps along
    the way. Moral, don't be bored in class, do something useful instead.

    Well, and I apparently missed
    the point of school, thinking it was more of an endurance thing with
    sort of a vague pretense of education (and I probably would have learned
    more if they just let me spend the time doing whatever else).

    For most people, school attempts to give the students just enough knowledge >that they are not burdens on society.

    My high school (1970s, when the split was K-7, 7-9, 10-12) had
    four "communities".

    Traditional
    Career
    Work Study
    Flexible Individual Learning (FIL)

    The college-bound were generally part of the
    FIL community. Career included business classes,
    traditional was more like the olden days and
    Work Study included off-school apprenticeships,
    shop classes, electronics training, etc.

    Students mostly took classes with peers in their
    community (there were over 400 in my graduating class).

    Worked rather well, but ended up segregating students
    by income level as well as IQ, so
    the school district changed that in the
    80s in the interest of equality treating the
    entire high school as a single community. The
    quality of the education received diminished
    thereafter, IMO.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Feb 27 16:14:17 2026
    From Newsgroup: comp.arch

    On 2/27/2026 1:57 PM, Scott Lurndal wrote:
    MitchAlsup <[email protected]d> writes:



    And, in middle and high-school, they just sorta forced me to sit through >>> normal classes (which sucked really hard)

    In my case, I remember sitting in the back of advanced algebra class
    (mostly senior HS people, me a sophomore) doing chemistry homework while
    vaguely listening to the teacher fail to get various students to solve
    a typical algebra problem. Then she called on me, I looked up at the board >> and in less than a second I rattled off the answer skipping 5 steps along
    the way. Moral, don't be bored in class, do something useful instead.

    Well, and I apparently missed >>> the point of school, thinking it was more of an endurance thing with
    sort of a vague pretense of education (and I probably would have learned >>> more if they just let me spend the time doing whatever else).

    For most people, school attempts to give the students just enough knowledge >> that they are not burdens on society.

    My high school (1970s, when the split was K-7, 7-9, 10-12) had
    four "communities".

    Traditional
    Career
    Work Study
    Flexible Individual Learning (FIL)

    The college-bound were generally part of the
    FIL community. Career included business classes,
    traditional was more like the olden days and
    Work Study included off-school apprenticeships,
    shop classes, electronics training, etc.

    Students mostly took classes with peers in their
    community (there were over 400 in my graduating class).

    Worked rather well, but ended up segregating students
    by income level as well as IQ, so
    the school district changed that in the
    80s in the interest of equality treating the
    entire high school as a single community. The
    quality of the education received diminished
    thereafter, IMO.

    AFAIK, the high schools I went to in the early had 2 groups:
    Normal;
    Special Education.

    I think initially there would have been some "AP" classes, but these
    were eliminated because of "No Child Left Behind" or similar (easier to
    fold everyone into the same classes for sake of standardized testing).

    ...


    Well, they also had other things going on at the time:
    Entering the building involved a checkpoint and showing an ID (to be
    scanned with a handheld barcode reader, *1);
    The building was typically partitioned off with metal gates and checkpoints;
    At certain times they would open the gates to allow freer movement,
    other times one would need to show ID (which would be logged) and let
    through using a smaller gate;
    During classes, typically also guards would patrol the halls along with
    dogs, and if one were in the hall during class and ran into one, they
    would need to show ID and a hall-pass and similar (as a sort of
    print-out ticket identifying the class and teacher and the
    date/time-issued, etc, sorta like a receipt one gets in a store, where
    one needed to ask the teacher to leave the room, and the teachers'
    computers would often have receipt printers; well, I guess as opposed to
    using a laser printer to print a hall pass, *2);
    ...

    *1: Back in these days, barcodes were still the go-to technology, not
    having yet been replaced by the use of QR codes and similar.

    *2: I guess it is however the cost dynamics worked out between using the
    laser printer (and a full sheet of paper) vs also giving each teacher a thermal printer in addition to a laser printer (for the off-chance of
    students needing to use the bathroom or similar?...).



    Note that getting to/from the school generally involved the use of
    school busses (and, at the end of the day, the goal was mostly to make
    it out of the building and onto the correct bus before the bus leaves;
    note that there was generally no time to stop or loiter, or one would
    miss the bus).

    Or, if the teacher delayed dismissing everyone at the final bell, one
    could also miss the bus.

    ...


    Had noticed that some of this was typically lacking in TV show
    depictions of high-schools, which often show people moving freely and socializing; and not so much the use of guards and checkpoints (or flows
    of students each along their respective side of the hall, and needing to
    weave through the crowd at intersections, where the flow would become
    more turbulent).

    Well, and say, one needing to try to make it efficiently through the
    halls as to not be late for the next class (where, say, hallway crowding
    would sometimes make it difficult to cross the building within the 5
    minute time limit).


    Well, one could also maybe try to stop by the bathroom between classes,
    but doing so would greatly increase the likelihood of being later, but
    it was a tradeoff between tardiness and needing to bother the teacher
    for a pass to use the bathroom, so "lesser of two evils" or such.


    ...


    Not sure what modern high-schools are like though.


    Or such...

    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Feb 27 17:01:22 2026
    From Newsgroup: comp.arch

    On 2/27/2026 1:27 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/22/2026 3:52 PM, John Dallman wrote:
    In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote: >>>
    Does imply that my younger self was notable, and not seen as just
    some otherwise worthless nerd.

    Educators who are any good notice the weird kids who are actually smart. >>>

    Sometimes I question if I really am though.

    Like, some evidence says I am, but by most metrics of "life success" I
    have done rather poorly.


    And, in middle and high-school, they just sorta forced me to sit through
    normal classes (which sucked really hard)

    In my case, I remember sitting in the back of advanced algebra class
    (mostly senior HS people, me a sophomore) doing chemistry homework while vaguely listening to the teacher fail to get various students to solve
    a typical algebra problem. Then she called on me, I looked up at the board and in less than a second I rattled off the answer skipping 5 steps along
    the way. Moral, don't be bored in class, do something useful instead.


    I didn't really do much in terms of assignments.


    No one really called on me either (but, IME, calling on students to
    answer questions wasn't really a thing).

    Mostly things were pretty one way.

    I think at one point, there was a thing in one class where things got
    heated between the teacher and one of the students, like I think they
    were getting in an argument about GWB's invasion of Afganistan or
    something (and whether or not the invasion was justified or similar);
    she sent him to the office, and principal sent him back.

    Usually, expected role was to sit passively, do assignments as assigned,
    and say nothing.


    I think, there was another time where a science teacher was talking
    about stuff in class, and started getting agitated, and deviated from
    the contents in the textbook, expressing her disagreement with
    naturalistic evolution and started going on about intelligent design and similar.

    At the time, I wasn't entirely sure what to make of this, she was
    putting her job on the line by doing this (I am not sure what happened
    with her after this).

    Like, it wasn't usually a thing that the teachers would go against the textbook.


    I didn't do much at the time, I think at the time I didn't expect that I
    would still be around this far into the future (decades later).


    Well, and I apparently missed
    the point of school, thinking it was more of an endurance thing with
    sort of a vague pretense of education (and I probably would have learned
    more if they just let me spend the time doing whatever else).

    For most people, school attempts to give the students just enough knowledge that they are not burdens on society.
    -------------------------

    Probably.

    I think the general assumption at the time was that people would either
    go on to entry-level jobs, or some would go on to college.

    Well, and then find that none of these jobs really wanted to hire anyone.

    Like, stores aren't going to hire more people to work the registers if
    they already have enough people working the registers. Well, or the
    people who went to do inventory or warehouse jobs, etc.


    The tricks Itanium could do with combinations of predicate registers were >>> pretty weird. There was at least one instruction for manipulating them
    which I was entirely unable to understand, with the manual in front of me >>> and pencil and paper to try examples. Fortunately, it never occurred in
    code generated by any of the compilers I used.

    It could have been a case where the obvious logic decoding "that" field in the instruction allowed for "a certain pattern" to perform what they described
    in the spec. I did some of this in Mc 88100, and this is what taught me never to do it again or allow anyone else to do it again.


    I haven't looked all that deeply into IA-64 predicate handling, partly
    as I had done it in a different way.


    Possibly.

    I had also looked into a more limited set of predicate registers at one
    point, but this fizzled in favor of just using GPRs.

    So, as noted:
    I have 1 predicate bit (T bit);
    Had looked into expanding it to 2 predicate bits (using an S bit as a
    second predicate), but this went nowhere.

    I have tried several organizations over the last 40 years of practice::
    In my Humble and Honest Opinion, the only constructs predicates should support are singular comparisons and comparisons using && and || with deMoganizing logic {~}--not because other forms are unuseful, but be-
    cause those are the constructs programmers use writing code.



    In this case, the pattern could have been expanded:
    OP //unconditional
    OP?T //T is Set
    OP?F //T is Clear
    OP?ST //S is Set
    OP?SF //S is Clear

    Analysis benefit of S bit here? None.
    The only notable nominal benefit of multiple predicate bits would be in
    the 1 true / 1 false case, but this is already handled by the ?T / ?F
    scheme, which (unlike IA-64) would not need multiple predicate bits for
    the THEN and ELSE branch.

    I considered U bits, but this went nowhere.
    These could have been U0..U7 as 1 bit flags, but would still need an
    operation to direct them into T to use for actual predication.


    Even if U-bit instructions were added, they couldn't save much over, say
    (RV like notation for XG3, *1):
    SGE X10, X18, 1
    SLT X11, X19, 10
    AND X0, X10, X11 //T=(X10&X11)!=0
    OP?T ...
    OP?F ...
    ...
    For:
    if((x>0) && (x<10))
    { ... }
    else
    { ... }

    Or, "if(!x) ...":
    SEQ X0, X18, X0
    OP1?T


    *1: This particular pattern is N/E in XG1 or XG3, and N/A in RISC-V.
    In XG1/XG2, there were 2R instructions for some of these cases instead
    (but these were dropped in favor of using X0/Zero to encode the same
    intent, but as noted XG1 and XG2 lack a Zero register).

    In theory, could revive more of XG2's 2R instructions in XG3, but the alternative here being to just leave them as disallowed and use
    zero-register encodings to signal the same intent (in the name of
    simplifying the ISA design).


    Things get maybe more complex for nested branching:
    if(x<y)
    {
    if(x<z)
    ... else ...
    }else
    {
    ...
    }
    But, this is more a compiler/code-structuring issue than an
    actual/significant ISA limitation. And, in most other cases, complex
    nested branches represent cases too bulky to benefit from predication
    (much past a small number of instructions, loses out to the use of
    branches).


    Where using GPRs here achieves the same basic effect without needing to
    add any new encodings or special-case handling to direct comparison
    output into the U bits. Though, ultimately, the U bits ended up used
    more just as a way to optionally detect LR stomping.

    ...


    Nevermind if XG3 still falls slightly behind XG2 for code-density.
    Harder to nail this down exactly, possibly:
    Usage of 8-arg ABI over 16 arg ABI (*);
    Slightly fewer callee save registers (28 vs 31);
    Loss of various 2R instructions and similar;
    ...

    *: Had temporarily moved to a 16-arg ABI, but ended up reverting this
    choice as the number of ABI related issues was non-zero. Did keep a
    register assignment change that went from 24 to 28 callee save registers.


    Code density and performance still beat out my extended variants of
    RISC-V though.

    ...


    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Feb 28 16:41:53 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/22/2026 3:52 PM, John Dallman wrote:
    In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:

    Does imply that my younger self was notable, and not seen as just
    some otherwise worthless nerd.

    Educators who are any good notice the weird kids who are actually smart.


    Sometimes I question if I really am though.

    Like, some evidence says I am, but by most metrics of "life success" I
    have done rather poorly.


    And, in middle and high-school, they just sorta forced me to sit through normal classes (which sucked really hard). Well, and I apparently missed
    the point of school, thinking it was more of an endurance thing with
    sort of a vague pretense of education (and I probably would have learned more if they just let me spend the time doing whatever else).

    ...



    But, it seems like a case of:
    By implication, I am smart, because if I wasn't, even my own (sometimes pointless) hobby interests would have been out of reach.

    Like, not a world of difficulty justifying them, or debating whether or
    not something is worth doing, but likely not something someone could do
    at all.


    Or, maybe, like encountering things that seem confusing isn't such a
    rare experience (or that people have learned how to deal more
    productively with things they can see but don't understand?...).


    But, there is a thing I have noted:
    I had a few times mentioned to people about finding that certain AIs had gotten smart enough to start understanding how a 5/6 bit finite state machine to predict repeating 1-4 bit patterns would be constructed.

    Then, I try to describe it, and then realize that for the people I try
    to mention it to, it isn't that they have difficulty imagining how one
    would go about filling in the table and getting all of the 4 bit
    patterns to fit into 32 possible states. Many seem to have difficulty understanding how such a finite state machine would operate in the first place.


    Even if, it seems like this part seems like something that pretty much anyone should be able to understand.

    Initially, I had used this as a test case for the AIs because it posed "moderate difficulty" for problems which could be reasonably completely described in a chat prompt (and is not overly generic).

    Nevermind if it is still a pain to generate tables by hand, and my
    attempts at hand-generated tables have tended to have worse adaptation
    rates than those generated using genetic algorithms (can be more clean looking, but tend to need more input bits to reach the target state if
    the pattern changes).


    Sometimes I feel like a poser.
    Other things, it seems, I had taken for granted.

    Seems sometimes if I were "actually smart", would have figured out some
    way to make better and more efficient use of my span of existence.

    BGB, please don't give up!

    I think it is very obvious to all the regulars here that you are
    obviously very bright, otherwise you would never even have started most
    of the projects you've told us about, not to mention actually making
    them work.

    Yes, on a number of occations I have thought that maybe you were
    attacking the wrong set of problems, but personally I've been very
    impressed, for several years now.

    Just keep on doing what you find interesting!

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Feb 28 16:48:39 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated?  Tree searching and binary search in arrays come to >>>> mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and bandwidth >>> bound.

    If a problem is bandwidth-bound, then differences between conventional>> architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can>> help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems.  For example, imagine a company employee file (table), >>> with a (say 300 byte) record for each of its many thousands of employees >>> each containing typical employee stuff).  Now suppose someone wants to
    know "What is the total salary of all the employees in the "Sales"
    department.  With no index on "department", but it is at a fixed>>> displacement within each record, the code looks at each record, does a
    trivial test on it, perhaps adds to a register, then goes to the next>>> record.  This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming>> language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem>> into a latency-bound problem.  If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2>> or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips. Seemingly about the only real hope the CPU has is to have a large cache and just > hope that the data happens to be in the cache (and has been accessed
    previously or sufficiently recently) else it is just kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this > personally.

    Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.


    Like, for example, decoding a PNG like format vs a JPG like format:
      PNG decoding typically processes the image as several major phases:
        Decompress the Deflate-compressed buffer into memory;
        Walk over the image, running scanline filters,
          copying scanlines into a new (output) buffer.
    Could you have a secondary thread that started as soon as one (or a
    small number of) scanline(s) were available, taking advantage of any
    shared $L3 cache to grab the data before it is blown away?

    Even if the parts, taken in isolation, should be fast:
      The image buffers are frequently too large to fit in cache;
      Cache misses tend to make PNG decoding painfully slow,
        even when using faster filters.
        If using the Paeth filter though, this adds extra slowness,
          due to branch-predictor misses.
          On targets like x86,
            the filter is frequently implemented using branches;
            The branch miss rate is very high.
            So, a naive branching version, performs like dog crap.
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.

    So, net result: Despite its conceptual simplicity, PNG's decode-time performance typically sucks.

    Contrast, a decoder for a JPEG like format can be made to process one
    block at a time and go all the way to final output. So, JPEG is often
    faster despite the more complex process (with transform stages and a colorspace transform).


    The Paeth filter slowness does seem a little odd though:
    Theoretically, a CPU could turn a short forward branch into predication;
    But, this doesn't tend to be the case.

    It then is faster to turn the filter into some convoluted mess of
    arithmetic and masking in an attempt to reduce the branch mispredict costs.
    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.
    (I might take a look a png decoding at some point)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sat Feb 28 16:57:00 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 2/22/2026 3:52 PM, John Dallman wrote:
    In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote: >>>
    Does imply that my younger self was notable, and not seen as just
    some otherwise worthless nerd.

    Educators who are any good notice the weird kids who are actually smart. >>>

    Sometimes I question if I really am though.

    Like, some evidence says I am, but by most metrics of "life success" I
    have done rather poorly.


    And, in middle and high-school, they just sorta forced me to sit through
    normal classes (which sucked really hard)

    In my case, I remember sitting in the back of advanced algebra class
    (mostly senior HS people, me a sophomore) doing chemistry homework while vaguely listening to the teacher fail to get various students to solve
    a typical algebra problem. Then she called on me, I looked up at the board and in less than a second I rattled off the answer skipping 5 steps along
    the way. Moral, don't be bored in class, do something useful instead.

    I used a double physics time slot (i.e two 50-min time slots with a
    5-min break between them) in exactly the same way, except that I
    calculated ~24 digits of pi using the taylor series for atan(1/5) and atan(1/239). The latter part was much faster of course!

    Doing long divisions by 25 and (n^2+n) took the majority of the time.

    Terje
    PS. I re-implemented the exact same algorithm, using base 1e10, on the
    very first computer I got access to, a Univac 110x in University. This
    was my first ever personal piece of programming.
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Sat Feb 28 17:36:47 2026
    From Newsgroup: comp.arch

    Terje Mathisen <[email protected]> writes:
    MitchAlsup wrote:

    In my case, I remember sitting in the back of advanced algebra class
    (mostly senior HS people, me a sophomore) doing chemistry homework while
    vaguely listening to the teacher fail to get various students to solve
    a typical algebra problem. Then she called on me, I looked up at the board >> and in less than a second I rattled off the answer skipping 5 steps along
    the way. Moral, don't be bored in class, do something useful instead.

    I used a double physics time slot (i.e two 50-min time slots with a
    5-min break between them) in exactly the same way, except that I
    calculated ~24 digits of pi using the taylor series for atan(1/5) and >atan(1/239). The latter part was much faster of course!

    Coincidentally, I did the same exercise with the taylor
    series, albeit after school when I had access to the
    ASR-33 remotely dialed into either a PDP-8 (TSS/8.24) or
    an HP-3000 (MPE). I might have a listing of the
    PDP-8 basic program around in a box somewhere.



    Doing long divisions by 25 and (n^2+n) took the majority of the time.

    Terje
    PS. I re-implemented the exact same algorithm, using base 1e10, on the
    very first computer I got access to, a Univac 110x in University. This
    was my first ever personal piece of programming.

    My first was a simple BASIC "hello world" program in 1974 on a
    Burroughs B5500 (remotely, via again an ASR-33) which we had
    for a week in 7th grade math class.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Sat Feb 28 10:08:07 2026
    From Newsgroup: comp.arch

    On 2/27/2026 1:52 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    The one you mentioned in your last paragraph, specifically,
    the limit of memory-level parallelism of the hardware.


    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically
    increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand,
    doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to increase the MLP.

    No. The memory system throughput depends upon the access pattern. It
    is easier/lower cost to increase the throughput for sequential accesses
    than random (think wider interfaces, cache blocks larger than the amount accessed, etc.) But optimization for sequential workloads can actually
    hurt performance for random workloads, e.g. larger block sizes reduce
    the number of accesses for sequential workloads, but each access takes
    longer, thus hurting random workloads. So designers aim to maximize the throughput, subject to cost and technology constraints, for some mix of sequential (bandwidth) versus random (latency) access.



    Concerning a reduction in latency, that would not
    increase performance, because this application is already working at
    the bandwidth limit.

    Again, perhaps terminology. To me, it has mot maxed out the bandwidth.
    It has maxed out the throughput for this application. But throughput
    has components of latency and bandwidth. They are different, but both
    are important and it is useful for a designer to think of them separately.



    I feel the urge to write up a mock variant of your use case and
    measure whether reality confirms my expectations, but currently have
    no time for that.

    OK.

    But let's take a slightly simpler case for which I already have a
    program:

    I will spend some time necessary to analyze this in more detail, but my impression is that you are using the term "bandwidth" for what I would
    call "throughput". Throughput has two components, bandwidth and
    latency. It is useful to think of them separately.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From antispam@[email protected] (Waldek Hebisch) to comp.arch on Sat Feb 28 21:49:35 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> wrote:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    I agree with your definition, but my prediction is somewhat different.
    First, consider silly program that goes sequentially over larger
    array accessing all lines. AFAICS it you should see tiny effect when
    program uses only one byte from each line compared to using whole
    line. Now consider variant that accesses every fifth line.
    There are differences, one that prefetcher needs to realize that
    there is no need to prefetch intermediate lines. Second difference
    is that one can fetch lines quickly only when they are on a single
    page. Having "step 5" on lines means 5 times as many page crossings.
    I do not know how big are pages in modern DRAM, but at step large
    enough you will see significant delay due to page crossing. I
    would tend to call this delay "latency", but it is somewhat
    murky. Namely, with enough prefetch and enough memory banks
    you can still saturate a single channel to the core (assuming
    that there are many cores, many channels from memory controller
    to memory banks but only single channel between memory controller
    and each core. Of course, modern system tend to have limited
    number of memory banks, so the argument above is purely theoretical.

    Somewhat different case is when there are independent loads from
    random locations, something like

    for(i = 0; i < N; i++) {
    s += m[f(i)];
    }

    where 'f' is very cheap to compute, but hard to predict by the
    hardware. In case above reorder buffer and multiple banks helps,
    but even with unlimited CPU resurces maximal number of accesses
    is number of memory banks divided by access time of single bank
    (that is essentialy latency of memory array).

    Then there is pointer chasing case, like

    for(i = 0; i < N; i++) {
    j = m[j];
    }

    when 'm' is filled with semi-random cyclic pattern this behaves
    quite badly, basically you can start next access only when you
    have result of the previous access. In practice, large 'm'
    seem to produce large number of cache misses for TLB entries.

    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.

    That disk analogy is flaved. AFAIK there is no penalty for choosing
    "far away" pages compared to "near" ones (if any the opposite: Row
    Hammer shows that accesses to "near" pages mean that given page may
    need more frequent refresh). In case of memory, time spent in
    memory controller is non-negligible, at least for accesses within
    single page. AFAIK random access to lines withing single page
    costs no more than sequential access, for disk you want sequential
    access to a single track.

    Consumer systems usually had only a single head assembly, while
    even low end systems have multiple prefetch bufffers.
    --
    Waldek Hebisch
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 1 05:39:13 2026
    From Newsgroup: comp.arch

    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    BGB wrote:
    On 2/24/2026 5:25 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/21/2026 8:18 AM, Anton Ertl wrote:

    big snip

    Otherwise what kind of common code do we have that is
    memory-dominated?  Tree searching and binary search in arrays come to >>>>> mind, but are they really common, apart from programming classes?

    It is probably useful to distinguish between latency bound and
    bandwidth
    bound.

    If a problem is bandwidth-bound, then differences between conventional
    architectures and EPIC play no role, and microarchitectural
    differences in the core play no role, either; they all have to wait
    for memory.

    For latency various forms of prefetching (by hardware or software) can
    help.

    Many occur in commercial (i.e. non scientific) programs, such as
    database systems.  For example, imagine a company employee file
    (table),
    with a (say 300 byte) record for each of its many thousands of
    employees
    each containing typical employee stuff).  Now suppose someone wants to >>>> know "What is the total salary of all the employees in the "Sales"
    department.  With no index on "department", but it is at a fixed
    displacement within each record, the code looks at each record, does a >>>> trivial test on it, perhaps adds to a register, then goes to the next
    record.  This it almost certainly memory latency bound.

    If the records are stored sequentially, either because the programming
    language supports that arrangement and the programmer made use of
    that, or because the allocation happened in a way that resulted in
    such an arrangement, stride-based prefetching will prefetch the
    accessed fields and reduce the latency to the one due to bandwidth
    limits.

    If the records are stored randomly, but are pointed to by an array,
    one can prefetch the relevant fields easily, again turning the problem
    into a latency-bound problem.  If, OTOH, the records are stored
    randomly and are in a linked list, this problem is a case of
    pointer-chasing and is indeed latency-bound.

    BTW, thousands of employee records, each with 300 bytes, fit in the L2
    or L3 cache of modern processors.


    FWIW:

    IME, code with fairly random access patterns to memory, and lots of
    cache misses, is inherently slow; even on big/fancy OoO chips.
    Seemingly about the only real hope the CPU has is to have a large
    cache and just hope that the data happens to be in the cache (and has
    been accessed previously or sufficiently recently) else it is just
    kinda SOL.

    If there is some way that CPU's can guess what memory they need in
    advance and fetch it beforehand, I have not seen much evidence of this
    personally.

    Rather, as can be noted, memory access patterns can often make a
    fairly large impact on the performance of some algorithms.


    Like, for example, decoding a PNG like format vs a JPG like format:
       PNG decoding typically processes the image as several major phases:
         Decompress the Deflate-compressed buffer into memory;
         Walk over the image, running scanline filters,
           copying scanlines into a new (output) buffer.

    Could you have a secondary thread that started as soon as one (or a
    small number of) scanline(s) were available, taking advantage of any
    shared $L3 cache to grab the data before it is blown away?


    It is possible to make it piecewise and incremental, but making the
    inflater incremental (and fast) is more complexity and difficulty than
    making a JPEG-like format that is both fast and lossless.

    Though, zlib supports incremental decoding of the type that would be
    useful here, but using zlib this way isn't particularly fast.


    Ironically, if not for cache misses, PNG (and JPEG 2000) would likely be
    some of the faster image codecs around. But, both PNG and JP2K suffer
    from some similar issues here.


    Though, if you stick a Haar of CDF5/3 wavelet in a small fixed-size
    block, it works well (and is faster than if applied over larger
    raster-ordered planes).

    Though, if designing a new codec to optimize for this, could make sense
    to increase the block size from 8x8 to 16x16; with the effective
    macroblock size increased to 32x32.

    Mostly because this is still small enough to probably fit in the L1
    cache on most CPU (assuming one isn't wasting too much memory on the
    entropy coder or similar).

    Though, likely 32x32 blocks would be too big, and would push the decoder outside of "mostly fits in L1 cache" territory.


    My UPIC format stayed with 8x8 blocks though:
    8x8 blocks, 16x16 macroblocks (still 16x16 for 4:4:4).
    When there are four 8x8 blocks, they are encoded in Hilbert order.
    Used:
    STF+AdRice
    Z3V5 VLN's
    Encoded similar to Deflate distances,
    zigzag sign-folding for values.
    Block Transforms (as layered 1D transforms):
    BHT: Block Haar.
    CDF 5/3
    WHT (does poorly on average)
    DCT (lossy only)
    Colorspaces:
    RCT
    YCoCg-R
    YCbCr (Approx, Lossy Only)


    At lossless and high quality:
    BHT and CDF5/3 dominate;
    RCT and YCoCg dominate.

    Where, DCT and YCbCr mostly pull ahead at medium to low quality levels
    and for photo-like images.

    For lossy compression of cartoon-like graphics, CDF 5/3 often wins.


    Have noted:
    Is competitive with T.81 JPEG for compression ratio;
    Is slightly faster for decompression;
    For many lossless images,
    tends to beat PNG for both compression and speed.
    Though for many "very artificial" images:
    Such as for UI graphics,
    PNG still wins for compression.


    Compared with PNG, it has a relative weakness in that it isn't as
    effective with repeating patterns and large flat-colored areas. Both of
    these can benefit more from LZ compression. Though, partly QOI shares
    the same issue.



    Even if the parts, taken in isolation, should be fast:
       The image buffers are frequently too large to fit in cache;
       Cache misses tend to make PNG decoding painfully slow,
         even when using faster filters.
         If using the Paeth filter though, this adds extra slowness,
           due to branch-predictor misses.
           On targets like x86,
             the filter is frequently implemented using branches;
             The branch miss rate is very high.
             So, a naive branching version, performs like dog crap.

    This reminds me of CABAC decoding in h264, where the output of the arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to
    make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one
    of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise often suffers from cache misses (cycles lost to cache misses can outweigh the
    cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
    10/11: Too short, limits effectiveness.
    12: OK, leans faster;
    13: OK, leans better compression;
    14: Intermediate
    15: Slower still (Deflate is here)
    16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a single
    big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and
    slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    In this case, the relative slowness of falling back to a linear search
    being less than that of the cost of the L1 misses from a bigger lookup
    table.

    The relative loss of Huffman coding efficiency between a 13 bit limit
    and 15 bit limit is fairly modest.




    Where, say:
    AdRice:
    + Can be made reasonably fast and cheap.
    + Low memory footprint;
    + Cheap setup cost;
    - Often weaker than Huffman in terms of compression.
    - Pure AdRice Only deals with certain distributions
    (Requires STF or similar to mimic Huffman's generality).
    Static Huffman:
    + More optimal in terms of coding efficiency;
    - Speed requires initializing bulky lookup tables;
    - High overhead and ineffective for small payloads.
    Range Coding
    + Good for maximizing compression
    + Deals well with small payloads
    - Slow.


    For small data though in some use cases, the relative gains of entropy
    coding relative to raw bytes can become small though.

    Huffman falls first:
    Constant overheads of the Huffman tables themselves become the dominant
    part of the overhead.

    AdRice falls second:
    At a certain limit, it can become ineffective.

    Range falls third:
    Usually the best, but initial adaptation time becomes the bottleneck
    (needs a minimum number of symbols to actually adapt to anything).


    Had noted recently that AdRice's effectiveness can be improved slightly
    at a fairly modest speed cost by slowing the adaptation speed (say, by adjusting by a fraction of a bit each time).

    Say, typical:
    Q=0: Decrement K (if K>0)
    Q=1: Leave K as-is
    Q>=2: Increment K

    Where K is the number of fixed bits following the unary-coded prefix (Q).

    The tweak being to instead adapt a scaled K, say S, then define K=S>>4
    or similar (so, it effectively needs multiple symbols before the value
    of K updates, but increases how often symbols are encoded at the optimal
    value of K).


    As for STF (swap towards front):
    Usual strategy still to swap symbols with whatever is at (15*I/16) or
    similar.

    There are other schemes, but this one has most often ended up winning
    out IME.



    So, net result: Despite its conceptual simplicity, PNG's decode-time
    performance typically sucks.

    Contrast, a decoder for a JPEG like format can be made to process one
    block at a time and go all the way to final output. So, JPEG is often
    faster despite the more complex process (with transform stages and a
    colorspace transform).


    The Paeth filter slowness does seem a little odd though:
    Theoretically, a CPU could turn a short forward branch into predication;
    But, this doesn't tend to be the case.

    It then is faster to turn the filter into some convoluted mess of
    arithmetic and masking in an attempt to reduce the branch mispredict
    costs.

    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.

    (I might take a look a png decoding at some point)



    #if 0 //naive version, pays a lot for branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    int p, pa, pb, pc;

    p=a+b-c;
    pa=(p>a)?(p-a):(a-p);
    pb=(p>b)?(p-b):(b-p);
    pc=(p>c)?(p-c):(c-p);

    p=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
    return(p);
    }
    #endif

    #if 1 //avoid branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
    int p, pa, pb, pc;
    int ma, mb, mc;
    p=a+b-c;
    pa=p-a; pb=p-b; pc=p-c;
    ma=pa>>31; mb=pb>>31; mc=pc>>31;
    pa=pa^ma; pb=pb^mb; pc=pc^mc;
    ma=pb-pa; mb=pc-pb; mc=pc-pa;
    ma=ma>>31; mb=mb>>31; mc=mc>>31;
    p=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
    return(p);
    }
    #endif

    Where, the Paeth filter is typically the most heavily used filter in PNG decoding (because it tends to be the most accurate), but also the slowest.

    Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.


    If someone were designing a new PNG like format, one option could be, say:
    p=(3*a+3*b-2*c)>>2;
    //maybe: clamp p to 0..255 or similar
    Which is "similar, but cheaper".


    Otherwise, could be possible to have a faster PNG like format if the
    format were structured to allow doing everything in a single pass (with
    no separate LZ stage).

    If I were designing it, might also be tempted to use AdRice rather than Huffman.


    ...




    Otherwise:

    Meanwhile, I am mostly starting to question if one might ironically need
    to add a restriction clause to MIT-0 to preserve freedom, say, something
    like:
    This code may not be used in jurisdictions where usage would violate the
    terms of the No Warranty clause or where use of the code would be in
    violation of the laws within that jurisdiction.


    Since as-is, the existing language doesn't offer sufficient protection
    from liability in cases where users use code in violation of laws and
    where said laws hold the vendors or copyright holders liable for
    violation of local laws (say, because the code does not actively prevent
    users from using it in a way which is illegal within their laws, and
    which would be applied to parties outside of the jurisdiction in question).

    While MIT-0 allows for re-licensing, this would shift liability to the
    user of the code for using it in a way that violates local laws.

    Should offer protection except in cases where it could be argued that
    the developers had intended for users to use the code in ways which
    violated laws (which would be harder to prove except in cases where it
    could be argued that the sole intended purpose of the code was to be
    used in a way which violated a law; rather than otherwise benign code
    that was used in a way which violated the law).

    Well, at least in theory.


    Terje


    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sun Mar 1 12:18:10 2026
    From Newsgroup: comp.arch

    Scott Lurndal <[email protected]> schrieb:

    My first was a simple BASIC "hello world" program in 1974 on a
    Burroughs B5500 (remotely, via again an ASR-33) which we had
    for a week in 7th grade math class.

    I started out on my father's first programmable pocket calculator,
    a Casio model with 38 steps (I think).

    I was quite proud when I managed to factorize 123456789, which
    took some time.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Sun Mar 1 07:55:48 2026
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    I was quite proud when I managed to factorize 123456789, which
    took some time.

    Out of curiosity, I just used /usr/bin/factor: 3 3 3607 3803

    Which took 3ms. :)

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Sun Mar 1 19:02:18 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to > make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one > of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise often suffers from cache misses (cycles lost to cache misses can outweigh the
    cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
      10/11: Too short, limits effectiveness.
      12: OK, leans faster;
      13: OK, leans better compression;
      14: Intermediate
      15: Slower still (Deflate is here)
      16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a single
    big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and > slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).
    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.
    When you have many really short codes (think Morse!) , then you can
    profitably decode multiple in a single iteration.

    In this case, the relative slowness of falling back to a linear search > being less than that of the cost of the L1 misses from a bigger lookup > table.

    The relative loss of Huffman coding efficiency between a 13 bit limit
    and 15 bit limit is fairly modest.
    Yeah.
    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.

    (I might take a look a png decoding at some point)



    #if 0  //naive version, pays a lot for branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
        int p, pa, pb, pc;

        p=a+b-c;
        pa=(p>a)?(p-a):(a-p);
        pb=(p>b)?(p-b):(b-p);
        pc=(p>c)?(p-c):(c-p);

        p=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
        return(p);
    }
    #endif

    #if 1    //avoid branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
        int p, pa, pb, pc;
        int ma, mb, mc;
        p=a+b-c;
        pa=p-a;        pb=p-b;        pc=p-c;
        ma=pa>>31;    mb=pb>>31;    mc=pc>>31;
            pa=pa^ma;    pb=pb^mb;    pc=pc^mc;
        ma=pb-pa;    mb=pc-pb;    mc=pc-pa;
        ma=ma>>31;    mb=mb>>31;    mc=mc>>31;
        p=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
        return(p);
    }
    #endif

    Where, the Paeth filter is typically the most heavily used filter in PNG decoding (because it tends to be the most accurate), but also the slowest.

    Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.
    OK
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Sun Mar 1 19:19:45 2026
    From Newsgroup: comp.arch

    On 01/03/2026 13:18, Thomas Koenig wrote:
    Scott Lurndal <[email protected]> schrieb:

    My first was a simple BASIC "hello world" program in 1974 on a
    Burroughs B5500 (remotely, via again an ASR-33) which we had
    for a week in 7th grade math class.

    I started out on my father's first programmable pocket calculator,
    a Casio model with 38 steps (I think).


    Would that have been a Casio fx-3600P ? I bought one of these as a
    teenager, and used it non-stop. 38 steps of program space was not a
    lot, but I remember making a library for complex number calculations for it.

    I was quite proud when I managed to factorize 123456789, which
    took some time.

    I used mine to find formulas for numerical integration (like Simpson's
    rule, but higher order). Basically useless, but fun!

    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sun Mar 1 20:24:04 2026
    From Newsgroup: comp.arch

    David Brown <[email protected]> schrieb:
    On 01/03/2026 13:18, Thomas Koenig wrote:
    Scott Lurndal <[email protected]> schrieb:

    My first was a simple BASIC "hello world" program in 1974 on a
    Burroughs B5500 (remotely, via again an ASR-33) which we had
    for a week in 7th grade math class.

    I started out on my father's first programmable pocket calculator,
    a Casio model with 38 steps (I think).


    Would that have been a Casio fx-3600P ? I bought one of these as a teenager, and used it non-stop. 38 steps of program space was not a
    lot, but I remember making a library for complex number calculations for it.

    Either the fx-180P or the fx-3600P.


    I was quite proud when I managed to factorize 123456789, which
    took some time.

    I used mine to find formulas for numerical integration (like Simpson's
    rule, but higher order). Basically useless, but fun!

    Later, I had a fx-602P, which was a much larger beast. For this,
    I programmed a whole "Kurvendiskussion" (not sure what the English
    term is, it entails finding roots, extrema and inflection points),
    learning about the non-joys of numeric differentiation in the process.
    I deleted this before my final exams, though :-)

    I still have a list of programs I wrote back then, including a
    Moon Lander, although I lost the calculator when studying.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Sun Mar 1 21:12:39 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Stefan Monnier <[email protected]> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software
    pipelining)), which in turn is based on the work on compilers for
    microcode.

    That did not solve memory latency, but that's a problem even for OoO
    cores.

    I suspect a big part of the problem was tension between Intel and HP
    were the only political solution was allowing the architects from both
    sides to "dump in" their favorite ideas. A recipe for disaster.

    The HP side had people like Bob Rau (Cydrome) and Josh Fisher
    (Multiflow), and given their premise, the architecture is ok; somewhat
    on the complex side, but they wanted to cover all the good ideas from
    earlier designs; after all, it was to be the one architecture to rule
    them all (especially performancewise). You cannot leave out a feature
    that a competitor could then add to outperform IA-64.

    The major problem was that the premise was wrong. They assumed that
    in-order would give them a clock rate edge, but that was not the case,
    right from the start (The 1GHz Itanium II (released July 2002)
    competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
    XP (released June 2002)). They also assumed that explicit parallelism
    would provide at least as much ILP as hardware scheduling of OoO CPUs,
    but that was not the case for general-purpose code, and in any case,
    they needed a lot of additional ILP to make up for their clock speed >disadvantage.

    As I've said before: I worked at HP during IA64, and it was not driven
    by technical issues, but rather political/financial issues.

    On HP's side, IA64 was driven by HP Labs, which was an independent group
    doing technical investigations without any clear line to products. They
    had to "sell" their ideas to the HP development groups, who could ignore them. They managed to get some upper level HP managers interested in IA64,
    and took that directly to Intel. The HP internal development groups (the
    ones making CPUs and server/workstation chipsets) did almost nothing with
    IA64 until after Intel announced the IA64 agreement.

    IA64 was called PrecisionArchitecture-WideWord (PA-WW) by HP Labs as a
    follow on to PA-RISC. The initial version of PA-WW had no register
    interlocks whatsoever, code had to be written to know the L1 and L2
    cache latency, and not touch the result registers too soon. This was
    laughed out of the room, and they came back with interlocks in the next iteration. This happened in 1993-1994, which was before the Out-of-Order
    RISCs came to market (but they were in development in HP and Intel), so the IA64 decisions were being made in the time window before folks really got to see what OoO could do.

    Also on HP's side, we had our own fab, which was having trouble keeping up
    with the rest of the industry. Designers felt performance was not
    predictable, and the fab's costs were escalating. The fab was going to
    have trouble getting to 180nm and beyond. So HP wanted access to Intel's
    fabs, and that was part of the IA64 deal--we could make PA-RISC chips on Intel's fabs for a long time.

    On Intel's side, Intel was divided very strongly geographically. At the
    time, Hillsboro was "winning" in the x86 CPU area, and Santa Clara was
    on the outs (I think they did 860 and other failures like that). So
    when Santa Clara heard of IA64, they jumped on the opportunity--a way to
    leap past Hillsboro. IA64 solved the AMD problem--with all new IA64
    patents, AMD couldn't clone it like x86, so management was interested. Technically, IA64 just had to be "as good as" x86, to make it worth
    while to jump to a new architecture which removes their competitor. I
    can see how even smart folks could get sucked in to thinking
    "architecture doesn't matter, and this new one prevents clones, so we
    should do it to eventually make more money".

    Both companies had selfish mid-level managers who saw a way to pad their resumes to leap to VP of engineering almost anywhere else. And they were right--on HP's side, I think every manager involved moved to a promotion
    at another company just before Merced came out. So IA64 was not going to
    get canceled--the managers didn't want to admit they were wrong.

    Both companies also saw IA64 as a way to kill off the RISC competitors.
    And on this point, they were right, IA64 did kill the RISC minicomputer
    market.

    The technical merits of IA64 don't make the top 5 in the list of reasons to
    do IA64 for either company.

    But HP using Intel's fabs didn't work out well. HP's first CPU on
    Intel's fabs was the 360MHz PA-8500. This was a disappointing step up
    from the 240MHz PA-8200 (which was partly speed limited by external L1
    cache memory, running at 4ns, and the 8500 moved to on-chip L1 cache
    memory, removing that limit). It turned out Intel's fab advantage was consistency and yield, not speed, and so it would take tuning to get the
    speed up. Intel did this tuning with large teams, and this was not easy
    for HP to replicate. And by this time, IBM was marketing a 180nm copper
    wire SOI process which WAS much faster (and yields weren't a concern for
    HP), so after getting the PA-8500 up to 550MHz after a lot of work, HP
    jumped to IBM as a fab, and the speeds went up to 750MHz and then 875MHz with some light tuning (and a lot less work).

    Everyone technically minded knew IA64 was technically not that great, but
    both companies had their reasons to do it anyway.

    Kent
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Mar 1 21:13:52 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 2/27/2026 1:52 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so >>> the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits >>> of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    The one you mentioned in your last paragraph, specifically,
    the limit of memory-level parallelism of the hardware.


    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically >> increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, >> doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to increase the MLP.

    No. The memory system throughput depends upon the access pattern. It
    is easier/lower cost to increase the throughput for sequential accesses
    than random (think wider interfaces, cache blocks larger than the amount accessed, etc.)

    It depends on the number of accesses and the ability to absorb the latency.

    For example, say we have a memory system with 256 banks, and a latency of 10µs, and each access is for a page of memory at 5GB/s.

    A page (4096B) at 5GB/s needs 800ns±
    So, we need 12.5 Banks to saturate the data channel
    And we have 16 busses each with 16 banks,
    we can sustain 80GB/s
    So, we need in excess of 200 outstanding requests
    Each able to absorb slightly more than 10µs.

    But this is more like a disk/flash system than main memory.

    Once each bank (of which there are 256) has more than a 3-deep queue
    the BW can be delivered as long as the requests have almost ANY order
    other than targeting 1 (or few) bank(s).

    But what you say is TRUE when one limits the interpretation of modern
    CPUs, but not when one limits themselves to applications running on
    modern CPUs requesting pages from long term storage. {how do you think
    Data Based work??}

    But optimization for sequential workloads can actually
    hurt performance for random workloads, e.g. larger block sizes reduce
    the number of accesses for sequential workloads, but each access takes longer, thus hurting random workloads. So

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    designers aim to maximize the throughput, subject to cost and technology constraints, for some mix of sequential (bandwidth) versus random (latency) access.

    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 1 18:05:23 2026
    From Newsgroup: comp.arch

    On 3/1/2026 12:02 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to
    make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one
    of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise
    often suffers from cache misses (cycles lost to cache misses can
    outweigh the cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
       10/11: Too short, limits effectiveness.
       12: OK, leans faster;
       13: OK, leans better compression;
       14: Intermediate
       15: Slower still (Deflate is here)
       16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a
    single big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and
    slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.


    Can work OK if one assumes all of the longer codes are prefixed by a
    longish series of 1s, which is usually but not necessarily true.

    Probably more true of a Deflate style table though, where sending the
    table as an array of symbol lengths does limit the configurations it can
    take.


    When you have many really short codes (think Morse!) , then you can profitably decode multiple in a single iteration.


    Typical in my case of both shorter length-limited Huffman, and of Rice
    coding.


    In the case of Rice, it can often make sense to use a lookup table to
    decode the Q prefix.

    Say (pseudocode):
    b=PeekBits()
    q=qtab[b&255];
    if(q>=8)
    { skb=16; v=(b>>8)&255; }
    else
    { skb=q+k+1; v=(q<<k)|((b>>(q+1))&((1<<k)-1)); }
    SkipBits(skb);




    In this case, the relative slowness of falling back to a linear search
    being less than that of the cost of the L1 misses from a bigger lookup
    table.

    The relative loss of Huffman coding efficiency between a 13 bit limit
    and 15 bit limit is fairly modest.

    Yeah.


    Some of my designs where I stuck with Huffman had ended up going over to
    a 13 bit limit partly for this reason, as in this case, the lookup table fitting in the L1 cache ends up as a win.


    Though, a factor is the number of tables in use; something more like
    JPEG where one has 4 of them (Y-DC, Y-AC, UV-DC, UV-AC); this still
    isn't going to fit in the L1 cache.


    So, errm, partial win here for STF+AdRice.

    Though, one could argue:
    Why use Rice for encoding the coefficient values as VLNs vs just
    encoding the values directly as Rice? Well, simple answer is mostly that
    this sucks.


    As noted, one might want to encode symbols that encode both a zero run
    and a value. JPEG had used Z4V4, with the value directly encoding the
    number of bits. The scheme used by JPEG was comparably space-inefficient though; and the general scheme used by Deflate was more space-efficient.

    So:
    JPEG scheme:
    Read V bits;
    Sign extend these bits for the full coefficient.
    val=ReadBits(v);
    sh=31-v;
    val=((s32)((val)<<sh))>>sh;
    Deflate Inspired:
    if(v>=4)
    { h=(v>>1)-1; val=((2|(v&1))<<h)|ReadBits(h); }
    else
    { val=v; }
    val=(val>>1)^(((s32)(val<<31))>>31);

    where, value table (unsigned) looks sorta like:
    pfx extra value
    0/1 0 0 / 1
    2/3 0 2 / 3
    4/5 1 4.. 7
    6/7 2 8..15
    8/9 3 16..31
    ...
    With the LSB then encoding the sign, so:
    0, -1, 1, -2, 2, ...


    Though, as can be noted, the loss of one bit for Z means that the
    maximum run of zeroes per symbol is shorter.

    The main difference is mostly that the JPEG scheme costs roughly 1 bit
    more per VLN (typical), or 2 bits more for small values (-1 and 1 need 2
    extra with the JPEG scheme).


    Though, despite the more efficient VLN scheme, UPIC does lose some
    entropic efficiency with its use of STF+AdRice here rather than Huffman.


    I would look for a way to handle multiple pixels at once, with SIMD
    code: There the masking/combining is typically the easiest way to
    implement short branches.

    (I might take a look a png decoding at some point)



    #if 0  //naive version, pays a lot for branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
         int p, pa, pb, pc;

         p=a+b-c;
         pa=(p>a)?(p-a):(a-p);
         pb=(p>b)?(p-b):(b-p);
         pc=(p>c)?(p-c):(c-p);

         p=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
         return(p);
    }
    #endif

    #if 1    //avoid branch penalties
    int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
    {
         int p, pa, pb, pc;
         int ma, mb, mc;
         p=a+b-c;
         pa=p-a;        pb=p-b;        pc=p-c;
         ma=pa>>31;    mb=pb>>31;    mc=pc>>31;
             pa=pa^ma;    pb=pb^mb;    pc=pc^mc;
         ma=pb-pa;    mb=pc-pb;    mc=pc-pa;
         ma=ma>>31;    mb=mb>>31;    mc=mc>>31;
         p=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
         return(p);
    }
    #endif

    Where, the Paeth filter is typically the most heavily used filter in
    PNG decoding (because it tends to be the most accurate), but also the
    slowest.

    Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.

    OK



    As an idle thought for a format "sorta PNG-like", but meant to be faster
    to decode (though, aiming to be closer to PNG than, say, QOI).


    Tag symbols encode commands, say:
    01..0F: 1..15 pixels, Delta per Pixel, useoffset=0;
    11..1F: 1..15 pixels, Delta per pixel, useOffset=1;
    21..2F: 1..15 pixels, Delta is 0, useOffset=0;
    30..3F: 1..15 pixels, Delta is 0, useOffset=1;
    41..4F: 1..15 pixels, Single Delta, Applied Every Pixel
    51..5F: 1..15 pixels, Single Delta, Applied One Pixel

    00/10/20/30/40/50: 16+ pixels
    Length follows, encoded as a VLN.
    Otherwise behaves the same as the corresponding 1-15 case.

    60..6F: -
    70..7F: -
    80..8F: -
    90..9F: -
    A0..AF: Single Delta of a given Type, useOffset=0;
    B0..BF: Single Delta of a given Type, useOffset=1;
    C0..CF: Set Predictor Function, useOffset=0;
    D0..DF: -
    E0..FF: Set Predictor Offset, useOffset=1;

    So, in this case:
    useOffset==0:
    Predict pixel values in a similar way to PNG.
    Adjacent pixels are used with predictor.
    useOffset==1:
    The Offset gives at a previous location in the image
    Deltas relative to the pixels at this offset.
    Offset is in raster space, must point within image.
    Encoded in a similar way to a Deflate distance.
    Always relative to the current position in raster space.
    The offset pixel is used as the prediction.

    So, setting an offset and then doing a run of delta==0 pixels
    effectively just copies the prior pixels. Otherwise, deltas are applied relative to those pixels.


    As in PNG, the deltas would likely be mod-256.
    Each delta point would be encoded as a symbol.


    Predictor functions:
    0: Special, restore last predictor
    1: Last Value
    2: Left
    3: Up
    4: Average of Left and Up
    6: (3*A+3*B-2*C)/4
    7: Paeth (Possible, but slower option)

    Delta Types:
    0: Special, last delta type
    1: dY, dR=dG=dB=dY, dA=0
    2: dRGB, dA=0
    3: dRGBA
    4: dYUV, dA=0 (encodes dRGB as RCT)
    5: dYUVA


    Would likely have STF+AdRice contexts for:
    Tag/Command Bytes
    Delta Bytes
    Length VLN prefix values.


    Downsides:
    This design as-is in not particularly elegant;
    Would exist awkwardly in a part of the design space between PNG and QOI;
    Would have a comparably more complex encoder.

    Would be considered as failing if:
    Compression is significantly worse than PNG;
    Fails to beat PNG at decode speed.

    The more complex stream representation is partly to compensate for not
    having an LZ stage.

    ...


    Would likely make sense to have the various configurations as function pointers, but this is not ideal (both due to code bulk and
    function-pointer overheads). But, function pointers are likely to be
    faster than decision trees in this case.


    Then again, might just be better to figure out some effective way to
    have an incremental stream-decoded inflater...


    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Mar 2 02:03:10 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 3/1/2026 12:02 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to
    make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one
    of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise
    often suffers from cache misses (cycles lost to cache misses can
    outweigh the cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
       10/11: Too short, limits effectiveness.
       12: OK, leans faster;
       13: OK, leans better compression;
       14: Intermediate
       15: Slower still (Deflate is here)
       16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a
    single big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and
    slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.


    Can work OK if one assumes all of the longer codes are prefixed by a
    longish series of 1s, which is usually but not necessarily true.

    In HW any pattern can be used. In SW only patterns that are almost
    satisfied by the current ISA can be considered. Big difference.
    --- Synchronet 3.21c-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Mon Mar 2 17:12:53 2026
    From Newsgroup: comp.arch

    On 2/28/2026 1:49 PM, Waldek Hebisch wrote:
    Stephen Fuld <[email protected]d> wrote:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
    could be pre-fetched, but with the inner loop so short, just a few
    instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a
    small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so
    the loads can be started right away, up to the limit of memory-level
    parallelism of the hardware. If the records are in RAM, the hardware
    prefetcher can help to avoid running into the scheduler and ROB limits
    of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited. Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically
    increasing the number of memory banks, having a wider interface between
    the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand,
    doing things to reduce the memory latency (say hypothetically a faster
    ram cell), would improve the performance. To me, that is the definition
    of being latency bound, not bandwidth bound.

    I agree with your definition, but my prediction is somewhat different.
    First, consider silly program that goes sequentially over larger
    array accessing all lines. AFAICS it you should see tiny effect when
    program uses only one byte from each line compared to using whole
    line. Now consider variant that accesses every fifth line.
    There are differences, one that prefetcher needs to realize that
    there is no need to prefetch intermediate lines. Second difference
    is that one can fetch lines quickly only when they are on a single
    page. Having "step 5" on lines means 5 times as many page crossings.
    I do not know how big are pages in modern DRAM, but at step large
    enough you will see significant delay due to page crossing. I
    would tend to call this delay "latency", but it is somewhat
    murky. Namely, with enough prefetch and enough memory banks
    you can still saturate a single channel to the core (assuming
    that there are many cores, many channels from memory controller
    to memory banks but only single channel between memory controller
    and each core. Of course, modern system tend to have limited
    number of memory banks, so the argument above is purely theoretical.

    Somewhat different case is when there are independent loads from
    random locations, something like

    for(i = 0; i < N; i++) {
    s += m[f(i)];
    }

    where 'f' is very cheap to compute, but hard to predict by the
    hardware. In case above reorder buffer and multiple banks helps,
    but even with unlimited CPU resurces maximal number of accesses
    is number of memory banks divided by access time of single bank
    (that is essentialy latency of memory array).

    Then there is pointer chasing case, like

    for(i = 0; i < N; i++) {
    j = m[j];
    }

    when 'm' is filled with semi-random cyclic pattern this behaves
    quite badly, basically you can start next access only when you
    have result of the previous access. In practice, large 'm'
    seem to produce large number of cache misses for TLB entries.

    Well, the examples I gave got confusing (my fault) because, as Anton
    pointed out, the table I used would fit into L3 cache on many modern
    systems. So this all got tied up in the difference between in cache
    results and in DRAM results. I don't disagree with your points, but
    they are tangential to the point I was trying to make.

    Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then
    added the value at that addressed byte to an accumulator. Repeat say
    10,000 times. I would call this program latency bound, but I suspect
    Anton would call it bandwidth bound. If that is true, then that
    explains the original differences Anton and I had.


    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster
    or spin the disk faster. You want higher bandwidth? Put more bits on a
    track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.

    That disk analogy is flaved. AFAIK there is no penalty for choosing
    "far away" pages compared to "near" ones (if any the opposite: Row
    Hammer shows that accesses to "near" pages mean that given page may
    need more frequent refresh). In case of memory, time spent in
    memory controller is non-negligible, at least for accesses within
    single page. AFAIK random access to lines withing single page
    costs no more than sequential access, for disk you want sequential
    access to a single track.

    Yes, but these are second order compared to the difference between
    latency and bandwidth on a disk.

    This whole think has spiraled into something far beyond what I expected,
    and what, I think, is useful. So unless you want me to, I probably
    won't respond further.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Mar 3 02:34:28 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 2/28/2026 1:49 PM, Waldek Hebisch wrote:
    Stephen Fuld <[email protected]d> wrote:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    ----------------------
    Somewhat different case is when there are independent loads from
    random locations, something like

    for(i = 0; i < N; i++) {
    s += m[f(i)];
    }

    where 'f' is very cheap to compute, but hard to predict by the
    hardware. In case above reorder buffer and multiple banks helps,
    but even with unlimited CPU resurces maximal number of accesses
    is number of memory banks divided by access time of single bank
    (that is essentialy latency of memory array).

    Then there is pointer chasing case, like

    for(i = 0; i < N; i++) {
    j = m[j];
    }

    when 'm' is filled with semi-random cyclic pattern this behaves
    quite badly, basically you can start next access only when you
    have result of the previous access. In practice, large 'm'
    seem to produce large number of cache misses for TLB entries.

    Well, the examples I gave got confusing (my fault) because, as Anton
    pointed out, the table I used would fit into L3 cache on many modern systems. So this all got tied up in the difference between in cache
    results and in DRAM results. I don't disagree with your points, but
    they are tangential to the point I was trying to make.

    Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then added the value at that addressed byte to an accumulator. Repeat say
    10,000 times. I would call this program latency bound, but I suspect
    Anton would call it bandwidth bound. If that is true, then that
    explains the original differences Anton and I had.

    This can be limited by the calculation latency of the RNG. A 1-cycle
    RNG would allow for as many memory references as the CPU allows to be
    in progress simultaneously. A K/cycle RNG would simply saturate the
    miss buffer sooner.

    Your typical 'Multiply and add with table indirection' RNG is about the
    latency of L2-hit or a bit longer; and way smaller than L3-hit.

    When RNG is longer than L3 latency--you will not be memory bound.


    Perhaps this distinction is clearer to me due to my background in the
    (hard) disk business. You want lower latency? Make the arm move faster >> or spin the disk faster. You want higher bandwidth? Put more bits on a
    track or interleave the data across multiple disk heads. And in a
    system, the number of active prefetches is naturally limited by the
    number of disk arms you have.

    That disk analogy is flaved. AFAIK there is no penalty for choosing
    "far away" pages compared to "near" ones (if any the opposite: Row
    Hammer shows that accesses to "near" pages mean that given page may
    need more frequent refresh). In case of memory, time spent in
    memory controller is non-negligible, at least for accesses within
    single page. AFAIK random access to lines withing single page
    costs no more than sequential access, for disk you want sequential
    access to a single track.

    Yes, but these are second order compared to the difference between
    latency and bandwidth on a disk.

    Or a RIAD of disks--but that is only of importance when one has enough
    disks for the "several ms" to be amortized by the "data occupancy" on
    the bus. {Little's Law} And in practical systems on needs on the order
    of several hundreds of disks for that to be manifest.

    This whole think has spiraled into something far beyond what I expected,
    and what, I think, is useful. So unless you want me to, I probably
    won't respond further.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Tue Mar 3 04:24:14 2026
    From Newsgroup: comp.arch

    On 3/1/2026 8:03 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    On 3/1/2026 12:02 PM, Terje Mathisen wrote:
    BGB wrote:
    On 2/28/2026 9:48 AM, Terje Mathisen wrote:
    This reminds me of CABAC decoding in h264, where the output of the
    arithmetic decoder is single bits that by definition cannot be
    predictable, but the codec typically uses that bit to branch.


    Yeah.

    Making arithmetic and range coders fast is also hard.

    I don't often use them as much because I am not aware of a good way to >>>> make them fast.



    This is part of why I had often ended up going for STF+AdRice or
    similar, which, while not the best in terms of compression, can be one >>>> of the faster options in many cases.

    Theoretically, table-driven Huffman could be faster, but likewise
    often suffers from cache misses (cycles lost to cache misses can
    outweigh the cost of the more complex logic of an AdRice coder).

    Huffman speed can be improved by reducing maximum symbol length and
    table size, but then this can lose much its compression advantage.

    Say, max symbol length:
       10/11: Too short, limits effectiveness.
       12: OK, leans faster;
       13: OK, leans better compression;
       14: Intermediate
       15: Slower still (Deflate is here)
       16: Slower (T.81 JPEG is here)


    Where, for 12/13 bits, the fastest strategy is typically to use a
    single big lookup table for the entropy decoder.

    For 15 or 16 bits, it is often faster to have a separate fast-path and >>>> slow path. Say, fast path matches on the first 9 or 10 bits, and then
    the slow path falls back to a linear search (over the longer symbols).

    I have looked at multi-level table lookups, where the symbol either is
    the one you want (short codes) or an index into a list of secondary
    tables to be used on the remaining bits.


    Can work OK if one assumes all of the longer codes are prefixed by a
    longish series of 1s, which is usually but not necessarily true.

    In HW any pattern can be used. In SW only patterns that are almost
    satisfied by the current ISA can be considered. Big difference.

    I guess it is possible someone could define hardware logic to support
    Huffman coding, but then again, it would be even easier to define
    hardware support for Rice coding.

    Though, this could range from more generic, like a CTNZ instruction
    (Count Trailing Non-Zero) to maybe more specialized instructions.

    Big downside for Huffman in HW is that almost invariably it would
    require big lookup tables, wheres Rice coding could mostly be done with
    fixed logic (and/or more generic instructions).


    Like, often the main lookup table used for Rice-decoding is just to do
    the equivalent of a CTNZ operation.

    Could integrate things more, but this would likely get into the
    territory of needing instructions with multiple destination registers or similar (and/or some sort of state-containing architectural feature).

    ...



    Meanwhile, did a quick mock-up of the Rice-coded vaguely PNG-like format mentioned a previously (calling it UNG for now), but thus far it is
    kinda looking like a turd.


    Comparing a few formats (with a photo-like image, 1024x688, lossless or
    max quality; speeds on my desktop PC):
    UPIC: 488K, 41.5 Mpix/s
    UNG : 1060K, 21.6 Mpix/s
    JPG : 410K, 25.3 Mpix/s (Q=100, inexact)
    PNG : 795K, 12.7 Mpix/s
    QOI : 1036K, 76.2 Mpix/s

    So, UPIC gives nearly JPEG-like compression while being lossless and
    faster than either JPEG or PNG.


    QOI wins for speed, but not compression (it is a byte-oriented format).
    It is fast, but in my own testing its compression still often loses to
    PNG (despite claims that it beats PNG).


    And, my UNG test was kind of a fail thus far, having a QOI-like
    file-size with closer to PNG-like speeds.

    Well, and the use of entropy coding is not necessarily a win though, if
    the design still turns out to be kinda a turd...



    UNG could maybe be improved with more fiddling, but thus far this is not
    a good start.


    It looks like UPIC is still in the lead here. Initially, it was mostly
    just focused on speed (and ability to support a lossless mode) but
    turned out to also be pretty solid for compression as well.

    Also maybe ironic that the Block Haar Transform is both faster than DCT
    and also still fairly effective as a block transform (and lossless). Not
    sure why Block-Haar is not more popular.

    Note that both UPIC and JPG see a pretty big speed boost when used lossy compression here (and, among the formats tested, JPG is the closest
    direct analog to UPIC among the mainline formats).


    Then again, maybe I need to test with highly-compressible synthetic
    images (like UI graphics), as this is typically where PNG holds a strong
    lead. And, I didn't really come up with the UNG design with photos in
    mind (even if they make sense as an initial test case).

    Would likely be a worse option for texture-maps than UPIC.
    Likely has no real advantage over indexed-color BMP for the use-cases
    where indexed-color BMP makes sense (and thus far the best compression
    method for indexed-color graphics seems to be to use LZ compression over
    the indexed color graphics).



    Still kinda annoying sometimes that it seems like stuff like this can be
    a bit hit or miss, one doesn't always know in advance whether something
    will work well or turn out to just kinda suck (well, and sometimes
    things that seem to suck initially can pull ahead with some more polishing).

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Tue Mar 3 11:22:40 2026
    From Newsgroup: comp.arch

    Thanks for this account of what happened with IA64.
    It's the first time I see an explanation that really makes sense.


    === Stefan


    Kent Dickey [2026-03-01 21:12:39] wrote:

    In article <[email protected]>,
    Anton Ertl <[email protected]> wrote:
    Stefan Monnier <[email protected]> writes:
    At the time of conception, there were amny arguments that {sooner or
    later} compilers COULD figure stuff like this out.

    I can't remember seeing such arguments comping from compiler people, tho.

    Actually, the IA-64 people could point to the work on VLIW (in
    particular, Multiflow (trace scheduling) and Cydrome (software >>pipelining)), which in turn is based on the work on compilers for >>microcode.

    That did not solve memory latency, but that's a problem even for OoO
    cores.

    I suspect a big part of the problem was tension between Intel and HP
    were the only political solution was allowing the architects from both >>>> sides to "dump in" their favorite ideas. A recipe for disaster.

    The HP side had people like Bob Rau (Cydrome) and Josh Fisher
    (Multiflow), and given their premise, the architecture is ok; somewhat
    on the complex side, but they wanted to cover all the good ideas from >>earlier designs; after all, it was to be the one architecture to rule
    them all (especially performancewise). You cannot leave out a feature
    that a competitor could then add to outperform IA-64.

    The major problem was that the premise was wrong. They assumed that >>in-order would give them a clock rate edge, but that was not the case, >>right from the start (The 1GHz Itanium II (released July 2002)
    competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
    XP (released June 2002)). They also assumed that explicit parallelism >>would provide at least as much ILP as hardware scheduling of OoO CPUs,
    but that was not the case for general-purpose code, and in any case,
    they needed a lot of additional ILP to make up for their clock speed >>disadvantage.

    As I've said before: I worked at HP during IA64, and it was not driven
    by technical issues, but rather political/financial issues.

    On HP's side, IA64 was driven by HP Labs, which was an independent group doing technical investigations without any clear line to products. They
    had to "sell" their ideas to the HP development groups, who could ignore them.
    They managed to get some upper level HP managers interested in IA64,
    and took that directly to Intel. The HP internal development groups (the ones making CPUs and server/workstation chipsets) did almost nothing with IA64 until after Intel announced the IA64 agreement.

    IA64 was called PrecisionArchitecture-WideWord (PA-WW) by HP Labs as a
    follow on to PA-RISC. The initial version of PA-WW had no register interlocks whatsoever, code had to be written to know the L1 and L2
    cache latency, and not touch the result registers too soon. This was
    laughed out of the room, and they came back with interlocks in the next iteration. This happened in 1993-1994, which was before the Out-of-Order RISCs came to market (but they were in development in HP and Intel), so the IA64 decisions were being made in the time window before folks really got to see what OoO could do.

    Also on HP's side, we had our own fab, which was having trouble keeping up with the rest of the industry. Designers felt performance was not predictable, and the fab's costs were escalating. The fab was going to
    have trouble getting to 180nm and beyond. So HP wanted access to Intel's fabs, and that was part of the IA64 deal--we could make PA-RISC chips on Intel's fabs for a long time.

    On Intel's side, Intel was divided very strongly geographically. At the time, Hillsboro was "winning" in the x86 CPU area, and Santa Clara was
    on the outs (I think they did 860 and other failures like that). So
    when Santa Clara heard of IA64, they jumped on the opportunity--a way to
    leap past Hillsboro. IA64 solved the AMD problem--with all new IA64
    patents, AMD couldn't clone it like x86, so management was interested. Technically, IA64 just had to be "as good as" x86, to make it worth
    while to jump to a new architecture which removes their competitor. I
    can see how even smart folks could get sucked in to thinking
    "architecture doesn't matter, and this new one prevents clones, so we
    should do it to eventually make more money".

    Both companies had selfish mid-level managers who saw a way to pad their resumes to leap to VP of engineering almost anywhere else. And they were right--on HP's side, I think every manager involved moved to a promotion
    at another company just before Merced came out. So IA64 was not going to
    get canceled--the managers didn't want to admit they were wrong.

    Both companies also saw IA64 as a way to kill off the RISC competitors.
    And on this point, they were right, IA64 did kill the RISC minicomputer market.

    The technical merits of IA64 don't make the top 5 in the list of reasons to do IA64 for either company.

    But HP using Intel's fabs didn't work out well. HP's first CPU on
    Intel's fabs was the 360MHz PA-8500. This was a disappointing step up
    from the 240MHz PA-8200 (which was partly speed limited by external L1
    cache memory, running at 4ns, and the 8500 moved to on-chip L1 cache
    memory, removing that limit). It turned out Intel's fab advantage was consistency and yield, not speed, and so it would take tuning to get the speed up. Intel did this tuning with large teams, and this was not easy
    for HP to replicate. And by this time, IBM was marketing a 180nm copper
    wire SOI process which WAS much faster (and yields weren't a concern for
    HP), so after getting the PA-8500 up to 550MHz after a lot of work, HP
    jumped to IBM as a fab, and the speeds went up to 750MHz and then 875MHz with some light tuning (and a lot less work).

    Everyone technically minded knew IA64 was technically not that great, but both companies had their reasons to do it anyway.

    Kent
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Tue Mar 3 09:15:14 2026
    From Newsgroup: comp.arch

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 2/27/2026 1:52 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me >>>>>> where I went wrong. I did expect the records to be sequential, and >>>>>> could be pre-fetched, but with the inner loop so short, just a few >>>>>> instructions, I thought that it would quickly "get ahead" of the
    prefetch. That is, that there was a small limit on the number of
    prefetches that could be in process simultaneously, and with such a >>>>>> small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so >>>>> the loads can be started right away, up to the limit of memory-level >>>>> parallelism of the hardware. If the records are in RAM, the hardware >>>>> prefetcher can help to avoid running into the scheduler and ROB limits >>>>> of the OoO engine.

    I think our difference may be just terminology rather than substance.
    To me, it is precisely the limit you mentioned that makes it latency
    rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    The one you mentioned in your last paragraph, specifically,
    the limit of memory-level parallelism of the hardware.


    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically >>>> increasing the number of memory banks, having a wider interface between >>>> the memory and the core, etc., all traditional methods for increasing
    memory bandwidth, would not improve the performance. On the other hand, >>>> doing things to reduce the memory latency (say hypothetically a faster >>>> ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to
    increase the MLP.

    No. The memory system throughput depends upon the access pattern. It
    is easier/lower cost to increase the throughput for sequential accesses
    than random (think wider interfaces, cache blocks larger than the amount
    accessed, etc.)

    It depends on the number of accesses and the ability to absorb the latency.

    For example, say we have a memory system with 256 banks, and a latency of 10µs, and each access is for a page of memory at 5GB/s.

    A page (4096B) at 5GB/s needs 800ns±
    So, we need 12.5 Banks to saturate the data channel
    And we have 16 busses each with 16 banks,
    we can sustain 80GB/s
    So, we need in excess of 200 outstanding requests
    Each able to absorb slightly more than 10µs.

    But this is more like a disk/flash system than main memory.

    Yes. In particular, requests to main memory are typically the size of a
    cache block, not a page. That changes the calculations above.


    Once each bank (of which there are 256) has more than a 3-deep queue
    the BW can be delivered as long as the requests have almost ANY order
    other than targeting 1 (or few) bank(s).

    So you are saying that the system is bandwidth limited as long as the
    CPU can sustain 768 (3 * 256) simultaneous prefetches in progress. OK :-)


    But what you say is TRUE when one limits the interpretation of modern
    CPUs, but not when one limits themselves to applications running on
    modern CPUs requesting pages from long term storage. {how do you think
    Data Based work??}

    OK. BTW, I have done database "stuff" since CODASYL database systems in
    the 1970s through relational systems in the 1980s. But page sized
    accesses to external storage wasn't what we were talking about.



    But optimization for sequential workloads can actually
    hurt performance for random workloads, e.g. larger block sizes reduce
    the number of accesses for sequential workloads, but each access takes
    longer, thus hurting random workloads. So

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk controller, making it the industry's first true cache disk controller.
    This was almost all about reducing latency (from tens of milliseconds on
    a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Tue Mar 3 17:37:23 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> writes:
    On 3/1/2026 1:13 PM, MitchAlsup wrote:


    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk >controller, making it the industry's first true cache disk controller.

    Let me guess: Your employer was purchased about five years
    later by a former Secretary of the Treasury :-).

    FWIW, about that same time, there were third-party
    RAM-based disk units available for the the systems
    that many of the big-B customers were using. Not inexpensive,
    but performed well (if still limited by disk controller
    and host I/O bus bandwidth (1MB/s and 8MB/s respectively
    on the medium systems line).

    The big-B medium systems also could reserve part of main
    memory and treat it as a RAMdisk.

    This was almost all about reducing latency (from tens of milliseconds on
    a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Tue Mar 3 09:53:21 2026
    From Newsgroup: comp.arch

    On 3/3/2026 9:37 AM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes:
    On 3/1/2026 1:13 PM, MitchAlsup wrote:


    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk
    controller, making it the industry's first true cache disk controller.

    Let me guess: Your employer was purchased about five years
    later by a former Secretary of the Treasury :-).

    No! Univac/Sperry/Unisys was our competitor. :-)


    FWIW, about that same time, there were third-party
    RAM-based disk units available for the the systems
    that many of the big-B customers were using. Not inexpensive,
    but performed well (if still limited by disk controller
    and host I/O bus bandwidth (1MB/s and 8MB/s respectively
    on the medium systems line).

    I know. My employer made SSDs (PCM for Univac FH 432 and 1792 drum
    systems) starting in the mid 1970s. The same memory cards were used as
    the cache in the cache disk controller.

    BTW, we later adapted the cache disk controller to work on CDC and
    Burroughs systems (I think all our Burroughs sales were to Large Systems customers). Then we were bought out by StorageTech, and they weren't interested in that market.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Mar 3 19:01:06 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 2/27/2026 1:52 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 2/24/2026 11:33 PM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    Let me better explain what I was trying to set up, then you can tell me
    where I went wrong. I did expect the records to be sequential, and >>>>>> could be pre-fetched, but with the inner loop so short, just a few >>>>>> instructions, I thought that it would quickly "get ahead" of the >>>>>> prefetch. That is, that there was a small limit on the number of >>>>>> prefetches that could be in process simultaneously, and with such a >>>>>> small CPU loop, it would quickly hit that limit, and thus be latency bound.

    I think that it's bandwidth-bound, because none of the memory (or
    outer-level cache) accesses depend on the results of previous ones; so >>>>> the loads can be started right away, up to the limit of memory-level >>>>> parallelism of the hardware. If the records are in RAM, the hardware >>>>> prefetcher can help to avoid running into the scheduler and ROB limits >>>>> of the OoO engine.

    I think our difference may be just terminology rather than substance. >>>> To me, it is precisely the limit you mentioned that makes it latency >>>> rather than bandwidth limited.

    I mentioned several limits. Which one do you have in mind?

    The one you mentioned in your last paragraph, specifically,
    the limit of memory-level parallelism of the hardware.


    Think of it this way. In the current
    situation, increasing the memory system bandwidth, say by hypothetically >>>> increasing the number of memory banks, having a wider interface between >>>> the memory and the core, etc., all traditional methods for increasing >>>> memory bandwidth, would not improve the performance. On the other hand, >>>> doing things to reduce the memory latency (say hypothetically a faster >>>> ram cell), would improve the performance.

    If the CPU is designed to provide enough memory-level parallelism to
    make use of the bandwidth (and that is likely, otherwise why provide
    that much bandwidth), then once the designers spend money on
    increasing the bandwidth, they will also spend the money necessary to
    increase the MLP.

    No. The memory system throughput depends upon the access pattern. It
    is easier/lower cost to increase the throughput for sequential accesses
    than random (think wider interfaces, cache blocks larger than the amount >> accessed, etc.)

    It depends on the number of accesses and the ability to absorb the latency.

    For example, say we have a memory system with 256 banks, and a latency of 10µs, and each access is for a page of memory at 5GB/s.

    A page (4096B) at 5GB/s needs 800ns±
    So, we need 12.5 Banks to saturate the data channel
    And we have 16 busses each with 16 banks,
    we can sustain 80GB/s
    So, we need in excess of 200 outstanding requests
    Each able to absorb slightly more than 10µs.

    But this is more like a disk/flash system than main memory.

    Yes. In particular, requests to main memory are typically the size of a cache block, not a page. That changes the calculations above.


    Once each bank (of which there are 256) has more than a 3-deep queue
    the BW can be delivered as long as the requests have almost ANY order
    other than targeting 1 (or few) bank(s).

    So you are saying that the system is bandwidth limited as long as the
    CPU can sustain 768 (3 * 256) simultaneous prefetches in progress. OK :-)


    But what you say is TRUE when one limits the interpretation of modern
    CPUs, but not when one limits themselves to applications running on
    modern CPUs requesting pages from long term storage. {how do you think
    Data Based work??}

    OK. BTW, I have done database "stuff" since CODASYL database systems in
    the 1970s through relational systems in the 1980s. But page sized
    accesses to external storage wasn't what we were talking about.



    But optimization for sequential workloads can actually
    hurt performance for random workloads, e.g. larger block sizes reduce
    the number of accesses for sequential workloads, but each access takes
    longer, thus hurting random workloads. So

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk controller, making it the industry's first true cache disk controller.
    This was almost all about reducing latency (from tens of milliseconds on
    a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers
    per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
    link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Tue Mar 3 11:35:56 2026
    From Newsgroup: comp.arch

    On 3/3/2026 11:01 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk
    controller, making it the industry's first true cache disk controller.
    This was almost all about reducing latency (from tens of milliseconds on
    a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers
    per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
    link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.

    Wow! But I think you will agree that design is unlikely to be used for
    "mass market" hundreds of terabyte systems used for commercial database systems, etc. (at least for a while) Do you know what is its cost per
    GB? SSDs certainly solve the multi millisecond access time of hard
    disks problem, but at a high cost. I think that hard disk sales are not
    going away for at least a while. :-)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Tue Mar 3 20:19:00 2026
    From Newsgroup: comp.arch

    In article <10o2a47$j1pl$[email protected]>, [email protected] (Kent Dickey) wrote:

    Technically, IA64 just had to be "as good as" x86, to make it worth
    while to jump to a new architecture which removes their competitor.
    I can see how even smart folks could get sucked in to thinking
    "architecture doesn't matter, and this new one prevents clones, so
    we should do it to eventually make more money".

    It very notably failed to offer the end-users anything attractive. Since
    those days, I've worked on the basis that you need to do that, and if you
    can manage it consistently, that keeps you competitive.

    Everyone technically minded knew IA64 was technically not that
    great, but both companies had their reasons to do it anyway.

    The problem with floating-point advance loads not always working if there
    was a function call between the load and the check made sure that IA64
    would be inferior on a technical level. There wasn't a way to fix that
    and retain software compatibility.

    Once that was clear, I was always looking to get rid of the architecture.
    We just said "no" to a US Government customer who wanted IA64 Linux. The
    policy I set for sales for other operating systems amounted to the same
    thing:

    "Find out the maximum they're willing to pay. Quote them three times that much."

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Tue Mar 3 21:55:15 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> writes:
    On 3/3/2026 11:01 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk
    controller, making it the industry's first true cache disk controller.
    This was almost all about reducing latency (from tens of milliseconds on >>> a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers
    per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
    link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.

    Wow! But I think you will agree that design is unlikely to be used for >"mass market" hundreds of terabyte systems used for commercial database >systems, etc.

    I think you'll find that the commercial databases are dominated
    by high-end NVMe (PCI based SSDs) for working storage, with
    spinning rust as archive storage.

    For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.

    https://techatlantix.com/mzwmo61thclf-00aw7.html

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Wed Mar 4 09:22:56 2026
    From Newsgroup: comp.arch

    Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then
    added the value at that addressed byte to an accumulator. Repeat say 10,000 times. I would call this program latency bound, but I suspect Anton would call it bandwidth bound. If that is true, then that explains the original differences Anton and I had.

    I think in theory, this is not latency bound: assuming enough CPU and
    memory parallelism in the implementation, it can be arbitrarily fast.
    But in practice it will probably be significantly slower than if you
    were to do a sequential traversal.

    Indeed, in practice you may sometimes see the performance be correlated
    with your memory latency, but if so it's only because your hardware
    doesn't offer enough parallelism (e.g. not enough memory banks).

    AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help speed
    it up (typically because of pointer-chasing). This is important,
    because it's a *lot* more difficult to reduce memory latency than it is
    to add bandwidth or parallelism.


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Wed Mar 4 07:19:29 2026
    From Newsgroup: comp.arch

    On 3/4/2026 6:22 AM, Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with a 2 GB
    array, and you used a random number generate an address within it, then
    added the value at that addressed byte to an accumulator. Repeat say 10,000 >> times. I would call this program latency bound, but I suspect Anton would >> call it bandwidth bound. If that is true, then that explains the original >> differences Anton and I had.

    I think in theory, this is not latency bound: assuming enough CPU and
    memory parallelism in the implementation, it can be arbitrarily fast.
    But in practice it will probably be significantly slower than if you
    were to do a sequential traversal.

    Precisely! Given that array size, essentially all memory accesses will
    be cache misses, i.e. to main DRAM, and the amount of parallelism
    required to make it bandwidth bound is totally impractical for a CPU to provide.


    Indeed, in practice you may sometimes see the performance be correlated
    with your memory latency, but if so it's only because your hardware
    doesn't offer enough parallelism (e.g. not enough memory banks).

    Yes. But providing "enough" parallelism in this case is impractical.


    AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help speed
    it up (typically because of pointer-chasing). This is important,
    because it's a *lot* more difficult to reduce memory latency than it is
    to add bandwidth or parallelism.


    As the old saying goes, "Bandwidth is only money, but latency is forever."
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Wed Mar 4 07:44:39 2026
    From Newsgroup: comp.arch

    On 3/3/2026 1:55 PM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes:
    On 3/3/2026 11:01 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk
    controller, making it the industry's first true cache disk controller. >>>> This was almost all about reducing latency (from tens of milliseconds on >>>> a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers
    per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
    link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay. >>
    Wow! But I think you will agree that design is unlikely to be used for
    "mass market" hundreds of terabyte systems used for commercial database
    systems, etc.

    I think you'll find that the commercial databases are dominated
    by high-end NVMe (PCI based SSDs) for working storage, with
    spinning rust as archive storage.

    For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.

    https://techatlantix.com/mzwmo61thclf-00aw7.html

    I freely admit that I am "out of the loop" for modern systems. However,
    this system costs about $125/TB. A quick check of Amazon shows typical
    hard disk prices at about $25 /TB. Are you saying that typical current systems are paying about 5 times the price, and way greater than the
    cost of the CPU for such systems? While obviously there is a market for
    such systems, it is hard for me to believe that the typical "enterprise" customer would be that market. Amazing!
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Wed Mar 4 15:57:33 2026
    From Newsgroup: comp.arch

    Stephen Fuld <[email protected]d> writes:
    On 3/3/2026 1:55 PM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes:
    On 3/3/2026 11:01 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk >>>>> controller, making it the industry's first true cache disk controller. >>>>> This was almost all about reducing latency (from tens of milliseconds on >>>>> a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers >>>> per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe >>>> link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.

    Wow! But I think you will agree that design is unlikely to be used for
    "mass market" hundreds of terabyte systems used for commercial database
    systems, etc.

    I think you'll find that the commercial databases are dominated
    by high-end NVMe (PCI based SSDs) for working storage, with
    spinning rust as archive storage.

    For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.

    https://techatlantix.com/mzwmo61thclf-00aw7.html

    I freely admit that I am "out of the loop" for modern systems. However, >this system costs about $125/TB. A quick check of Amazon shows typical
    hard disk prices at about $25 /TB. Are you saying that typical current >systems are paying about 5 times the price, and way greater than the
    cost of the CPU for such systems? While obviously there is a market for >such systems, it is hard for me to believe that the typical "enterprise" >customer would be that market. Amazing!

    Price vs. performance. The latter rules the enterprise space. SSDs
    require about 33% of the power and generate much less heat.

    Note that the amazon prices are generally for consumer grade
    hardware.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Wed Mar 4 19:03:54 2026
    From Newsgroup: comp.arch

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with a 2 GB
    array, and you used a random number generate an address within it, then
    added the value at that addressed byte to an accumulator. Repeat say 10,000 >> times. I would call this program latency bound, but I suspect Anton would >> call it bandwidth bound. If that is true, then that explains the original >> differences Anton and I had.

    I think in theory, this is not latency bound: assuming enough CPU and
    memory parallelism in the implementation, it can be arbitrarily fast.
    But in practice it will probably be significantly slower than if you
    were to do a sequential traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses might be
    very significant unless you've setup huge pages.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime will
    be horrible!

    Indeed, in practice you may sometimes see the performance be correlated
    with your memory latency, but if so it's only because your hardware
    doesn't offer enough parallelism (e.g. not enough memory banks).

    AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help speed
    it up (typically because of pointer-chasing). This is important,
    because it's a *lot* more difficult to reduce memory latency than it is
    to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a classic
    Cray could perform much better than a modern OoO cpu.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Mar 4 20:06:02 2026
    From Newsgroup: comp.arch

    On Wed, 4 Mar 2026 07:44:39 -0800
    Stephen Fuld <[email protected]d> wrote:
    On 3/3/2026 1:55 PM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes:
    On 3/3/2026 11:01 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's
    disk controller, making it the industry's first true cache disk
    controller. This was almost all about reducing latency (from
    tens of milliseconds on a non cache controller to hundreds of
    microseconds on a cache hit). There was a small improvement in
    transfer rate, but the latency reduction dominated the
    improvement.

    There is an SSD that can perform 3,300,000�4096B random read
    transfers per second on a PCIe 5.0-�4 connector. That is 13.2GB/s
    over the PCIe link which is BW limited to 15.x GB/s. Each "RAS"
    has a 70�s access delay.

    Wow! But I think you will agree that design is unlikely to be
    used for "mass market" hundreds of terabyte systems used for
    commercial database systems, etc.

    I think you'll find that the commercial databases are dominated
    by high-end NVMe (PCI based SSDs) for working storage, with
    spinning rust as archive storage.

    For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.

    https://techatlantix.com/mzwmo61thclf-00aw7.html

    I freely admit that I am "out of the loop" for modern systems.
    However, this system costs about $125/TB. A quick check of Amazon
    shows typical hard disk prices at about $25 /TB.
    That's not a fare comparison.
    $25/TB you see on Amazon is likely 5400 rpm 4TB disk.
    So, at 61 TB you only have 15 spindles. It means that even in ideal
    conditions of accesses distributed evenly to all disks your sequential
    read bandwidth is no better than ~1,200 MB/s.
    For comparison, sequential read speed of BM1743 SSD is 7,500 MB/s.
    In order to get comparable bandwidth with HDs you will need
    95 spindles. May be, a little less with 7200 rpm. I don't know how many spindles would you need with 13000 rpm HDs that enterprises used to use
    for databases 20+ years ago. It seems, they had better latency than
    7200, but about the same bandwidth. Anyway, I m not even sure that
    anybody makes them still.
    95 disks alone almost certainly cost more than 8KUSD. And on top of
    that you will need a dozen of expensive RAID controllers.
    So, essentially, when you paid 8KUSD for this SSD, you paid it for
    bandwidth alone. 100x improvement in latency that you got over HD-based solution is a free bonus. Density and power too.
    Are you saying that
    typical current systems are paying about 5 times the price, and way
    greater than the cost of the CPU for such systems? While obviously
    there is a market for such systems, it is hard for me to believe that
    the typical "enterprise" customer would be that market. Amazing!


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Mar 4 20:25:56 2026
    From Newsgroup: comp.arch

    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with
    a 2 GB array, and you used a random number generate an address
    within it, then added the value at that addressed byte to an
    accumulator. Repeat say 10,000 times. I would call this program
    latency bound, but I suspect Anton would call it bandwidth bound.
    If that is true, then that explains the original differences Anton
    and I had.

    I think in theory, this is not latency bound: assuming enough CPU
    and memory parallelism in the implementation, it can be arbitrarily
    fast. But in practice it will probably be significantly slower than
    if you were to do a sequential traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses might
    be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
    will be horrible!

    Indeed, in practice you may sometimes see the performance be
    correlated with your memory latency, but if so it's only because
    your hardware doesn't offer enough parallelism (e.g. not enough
    memory banks).

    AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help
    speed it up (typically because of pointer-chasing). This is
    important, because it's a *lot* more difficult to reduce memory
    latency than it is to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a classic
    Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not fit
    in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does something
    useful and has no cache hits at all. At very least, there will be
    reuse on instruction side. But I think that in order to completely avoid
    reuse on the data side you'll have to do something non-realistic.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Wed Mar 4 19:38:08 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with
    a 2 GB array, and you used a random number generate an address
    within it, then added the value at that addressed byte to an
    accumulator. Repeat say 10,000 times. I would call this program
    latency bound, but I suspect Anton would call it bandwidth bound.
    If that is true, then that explains the original differences Anton
    and I had.

    I think in theory, this is not latency bound: assuming enough CPU
    and memory parallelism in the implementation, it can be arbitrarily
    fast. But in practice it will probably be significantly slower than
    if you were to do a sequential traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses might
    be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
    will be horrible!

    Indeed, in practice you may sometimes see the performance be
    correlated with your memory latency, but if so it's only because
    your hardware doesn't offer enough parallelism (e.g. not enough
    memory banks).

    AFAIK, when people say "latency-bound" they usually mean that adding
    parallelism and/or bandwidth to your memory hierarchy won't help
    speed it up (typically because of pointer-chasing). This is
    important, because it's a *lot* more difficult to reduce memory
    latency than it is to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a classic
    Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not fit
    in classic Cray's main memory.

    The 1985 Cray-2 allowed 2GB, so theoretically possible with the
    OS+program into the 73 MB gap between 2GiB and 2E9.

    Easily done on a later Cray-Y-MP.


    Besides, it is nearly impossible to create a code that does something
    useful and has no cache hits at all. At very least, there will be
    reuse on instruction side. But I think that in order to completely avoid reuse on the data side you'll have to do something non-realistic.

    I think the curent "gedankenexperiment" is way beyond "something
    useful". :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Mar 4 18:51:11 2026
    From Newsgroup: comp.arch


    Stephen Fuld <[email protected]d> posted:

    On 3/3/2026 11:01 AM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    On 3/1/2026 1:13 PM, MitchAlsup wrote:

    Stephen Fuld <[email protected]d> posted:

    snip

    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points.

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk
    controller, making it the industry's first true cache disk controller.
    This was almost all about reducing latency (from tens of milliseconds on >> a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers
    per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
    link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.

    Wow! But I think you will agree that design is unlikely to be used for "mass market" hundreds of terabyte systems used for commercial database systems, etc. (at least for a while) Do you know what is its cost per
    GB? SSDs certainly solve the multi millisecond access time of hard
    disks problem, but at a high cost. I think that hard disk sales are not going away for at least a while. :-)

    IIRC:: It is 10-ish TB and If I remember correctly:: in the $3,000 range.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Mar 4 21:17:00 2026
    From Newsgroup: comp.arch

    On Wed, 4 Mar 2026 19:38:08 +0100
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with
    a 2 GB array, and you used a random number generate an address
    within it, then added the value at that addressed byte to an
    accumulator. Repeat say 10,000 times. I would call this program
    latency bound, but I suspect Anton would call it bandwidth bound.
    If that is true, then that explains the original differences
    Anton and I had.

    I think in theory, this is not latency bound: assuming enough CPU
    and memory parallelism in the implementation, it can be
    arbitrarily fast. But in practice it will probably be
    significantly slower than if you were to do a sequential
    traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses
    might be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
    will be horrible!

    Indeed, in practice you may sometimes see the performance be
    correlated with your memory latency, but if so it's only because
    your hardware doesn't offer enough parallelism (e.g. not enough
    memory banks).

    AFAIK, when people say "latency-bound" they usually mean that
    adding parallelism and/or bandwidth to your memory hierarchy
    won't help speed it up (typically because of pointer-chasing).
    This is important, because it's a *lot* more difficult to reduce
    memory latency than it is to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not
    fit in classic Cray's main memory.

    The 1985 Cray-2 allowed 2GB, so theoretically possible with the
    OS+program into the 73 MB gap between 2GiB and 2E9.

    Easily done on a later Cray-Y-MP.


    I had Cray-1 in mind.

    Cray-2 memory was big enough, but was it fast enough latency wise?
    All info I see about Cray-2 memory praises its great capacity and
    bandwidth, but tells nothing about latency.
    It seems, the latency was huge and the whole system was useful only due
    to mechanism that today we will call hardware prefetch. Or software
    prefetch? I am not sure.
    But it is possible that I misunderstood.


    Besides, it is nearly impossible to create a code that does
    something useful and has no cache hits at all. At very least, there
    will be reuse on instruction side. But I think that in order to
    completely avoid reuse on the data side you'll have to do something non-realistic.

    I think the curent "gedankenexperiment" is way beyond "something
    useful". :-)


    Agreed.
    Starting from the latest post of Stephen Fuld (2026-03-03 03:02) we left
    for good the realm of "useful".

    Terje



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@[email protected] to comp.arch on Wed Mar 4 11:49:28 2026
    From Newsgroup: comp.arch

    On 3/4/2026 11:17 AM, Michael S wrote:
    On Wed, 4 Mar 2026 19:38:08 +0100
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with
    a 2 GB array, and you used a random number generate an address
    within it, then added the value at that addressed byte to an
    accumulator. Repeat say 10,000 times. I would call this program
    latency bound, but I suspect Anton would call it bandwidth bound.
    If that is true, then that explains the original differences
    Anton and I had.

    I think in theory, this is not latency bound: assuming enough CPU
    and memory parallelism in the implementation, it can be
    arbitrarily fast. But in practice it will probably be
    significantly slower than if you were to do a sequential
    traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses
    might be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
    will be horrible!

    Indeed, in practice you may sometimes see the performance be
    correlated with your memory latency, but if so it's only because
    your hardware doesn't offer enough parallelism (e.g. not enough
    memory banks).

    AFAIK, when people say "latency-bound" they usually mean that
    adding parallelism and/or bandwidth to your memory hierarchy
    won't help speed it up (typically because of pointer-chasing).
    This is important, because it's a *lot* more difficult to reduce
    memory latency than it is to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not
    fit in classic Cray's main memory.

    The 1985 Cray-2 allowed 2GB, so theoretically possible with the
    OS+program into the 73 MB gap between 2GiB and 2E9.

    Easily done on a later Cray-Y-MP.


    I had Cray-1 in mind.

    Cray-2 memory was big enough, but was it fast enough latency wise?
    All info I see about Cray-2 memory praises its great capacity and
    bandwidth, but tells nothing about latency.
    It seems, the latency was huge and the whole system was useful only due
    to mechanism that today we will call hardware prefetch. Or software
    prefetch? I am not sure.
    But it is possible that I misunderstood.

    Well, Cray had that vector thing going for it. :-) And, as Mitch has repeatedly pointed out, the memory bandwidth to support it. And
    "reasonable" memory latency for the non-vector operations.


    Besides, it is nearly impossible to create a code that does
    something useful and has no cache hits at all. At very least, there
    will be reuse on instruction side. But I think that in order to
    completely avoid reuse on the data side you'll have to do something
    non-realistic.

    I think the curent "gedankenexperiment" is way beyond "something
    useful". :-)


    Agreed.
    Starting from the latest post of Stephen Fuld (2026-03-03 03:02) we left
    for good the realm of "useful".


    Of course. I was trying to overcome my previous error in specifying
    what I tried to make a totally memory latency dominated program, but
    messing up by not doing so. The whole thing started with a terminology difference between Anton and me about latency versus bandwidth, and has spiraled totally out of the original issue.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Wed Mar 4 20:15:53 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    On Wed, 4 Mar 2026 07:44:39 -0800
    Stephen Fuld <[email protected]d> wrote:

    On 3/3/2026 1:55 PM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes: =20
    On 3/3/2026 11:01 AM, MitchAlsup wrote: =20

    Stephen Fuld <[email protected]d> posted:
    =20
    On 3/1/2026 1:13 PM, MitchAlsup wrote: =20

    Stephen Fuld <[email protected]d> posted: =20

    snip
    =20
    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points. =20

    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's
    disk controller, making it the industry's first true cache disk
    controller. This was almost all about reducing latency (from
    tens of milliseconds on a non cache controller to hundreds of
    microseconds on a cache hit). There was a small improvement in
    transfer rate, but the latency reduction dominated the
    improvement. =20

    There is an SSD that can perform 3,300,000=D74096B random read
    transfers per second on a PCIe 5.0-=D74 connector. That is 13.2GB/s
    over the PCIe link which is BW limited to 15.x GB/s. Each "RAS"
    has a 70=B5s access delay. =20

    Wow! But I think you will agree that design is unlikely to be
    used for "mass market" hundreds of terabyte systems used for
    commercial database systems, etc. =20
    =20
    I think you'll find that the commercial databases are dominated
    by high-end NVMe (PCI based SSDs) for working storage, with
    spinning rust as archive storage.
    =20
    For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
    =20
    https://techatlantix.com/mzwmo61thclf-00aw7.html =20
    =20
    I freely admit that I am "out of the loop" for modern systems.
    However, this system costs about $125/TB. A quick check of Amazon
    shows typical hard disk prices at about $25 /TB.=20

    That's not a fare comparison.

    Indeed, it's not even a fair comparision :-)

    $25/TB you see on Amazon is likely 5400 rpm 4TB disk.
    So, at 61 TB you only have 15 spindles. It means that even in ideal >conditions of accesses distributed evenly to all disks your sequential
    read bandwidth is no better than ~1,200 MB/s.
    For comparison, sequential read speed of BM1743 SSD is 7,500 MB/s.

    In order to get comparable bandwidth with HDs you will need
    95 spindles. May be, a little less with 7200 rpm. I don't know how many >spindles would you need with 13000 rpm HDs that enterprises used to use
    for databases 20+ years ago. It seems, they had better latency than
    7200, but about the same bandwidth. Anyway, I m not even sure that
    anybody makes them still.

    95 disks alone almost certainly cost more than 8KUSD. And on top of
    that you will need a dozen of expensive RAID controllers.

    So, essentially, when you paid 8KUSD for this SSD, you paid it for
    bandwidth alone. 100x improvement in latency that you got over HD-based >solution is a free bonus. Density and power too.

    Density and power are the most important criteria in modern
    datacenters. Reliability is also a consideration. While
    the backblaze drive reports show moderately reasonable results
    for most spinning rust, the NVME SSD is far more reliable and
    consumes far less power and rack space than the equivalent
    hard disk would.

    Another advantage of NVMe cards is the availablity of
    PCI SR-IOV, which allows the NVMe card to be partitioned
    and made available to multiple independent guests without
    sacrificing security. The downside of host-based NVMe is
    the inability to share bandwidth with multiple hosts. That
    downside is eliminated with external NVMe based RAID
    subsystems connected via 400Gbe or FC.

    https://www.truenas.com/r-series/r60/

    7PB at 60GB/sec.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Wed Mar 4 21:07:00 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    Michael S <[email protected]> wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:
    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses might
    be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    When the working set does not allow any cache re-use, then a classic
    Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not fit
    in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does something
    useful and has no cache hits at all. At very least, there will be
    reuse on instruction side. But I think that in order to completely avoid >reuse on the data side you'll have to do something non-realistic.

    There is one very reasonable use case: testing a random number generator.
    A useful test is to ensure numbers are uncorrelated, so you get 3 random numbers called A, B, C, and you look up A*N*N + B*N + C to count the number
    of times you see A followed by B followed by C, where N is the range of
    the random value, say, 0 - 1023. This would be an array of 1 billion 32-bit values. You get 1000 billion random numbers, and then look through to make sure most buckets have a value around 1000. Any buckets less than 500 or
    more than 1500 might be considered a random number generator failure.
    This is a useful test since it intuitively makes sense--if some patterns are too likely (or unlikely), then you know you have a problem with your
    "random" numbers.

    Another use case would be an algorithm which wants to shuffle a large
    array (say, you want to create test cases for a sorting algorithm). I
    think most shuffling algorithms which are fair will randomly index into
    the array, and each of these will be a cache miss.

    Kent
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Wed Mar 4 23:35:11 2026
    From Newsgroup: comp.arch

    On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
    [email protected] (Kent Dickey) wrote:

    In article <[email protected]>,
    Michael S <[email protected]> wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:
    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses
    might be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not fit
    in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does something >useful and has no cache hits at all. At very least, there will be
    reuse on instruction side. But I think that in order to completely
    avoid reuse on the data side you'll have to do something
    non-realistic.

    There is one very reasonable use case: testing a random number
    generator. A useful test is to ensure numbers are uncorrelated, so
    you get 3 random numbers called A, B, C, and you look up A*N*N + B*N
    + C to count the number of times you see A followed by B followed by
    C, where N is the range of the random value, say, 0 - 1023. This
    would be an array of 1 billion 32-bit values. You get 1000 billion
    random numbers, and then look through to make sure most buckets have
    a value around 1000. Any buckets less than 500 or more than 1500
    might be considered a random number generator failure. This is a
    useful test since it intuitively makes sense--if some patterns are
    too likely (or unlikely), then you know you have a problem with your
    "random" numbers.


    Even if there are no cache hits in access of main histogram, there are
    still cache hits in PRNG that you are testing. Unless that is very
    simple PRNG completely implemented in registers.
    And even in case of very simple PRNG, standard PRNG APIs keep state in
    memory, so in order to avoid memory accesses=cache hits one would have
    to use non-standard API.

    Besides, there are other reasons why modern big OoO will run rounds
    around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect.
    2. Even if compilers were perfect, Cray has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer memory
    accesses running simultaneously.

    Another use case would be an algorithm which wants to shuffle a large
    array (say, you want to create test cases for a sorting algorithm). I
    think most shuffling algorithms which are fair will randomly index
    into the array, and each of these will be a cache miss.

    Kent

    I think that many if not all of the same arguments apply here too.
    But I didn't think deeply about it. Besides, there is more than one
    algorithm for random shuffle and their performance characteristics
    likely differ.







    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Mar 4 23:46:46 2026
    From Newsgroup: comp.arch


    Michael S <[email protected]> posted:

    On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
    [email protected] (Kent Dickey) wrote:

    In article <[email protected]>,
    Michael S <[email protected]> wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:
    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses
    might be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not fit
    in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does something >useful and has no cache hits at all. At very least, there will be
    reuse on instruction side. But I think that in order to completely
    avoid reuse on the data side you'll have to do something
    non-realistic.

    There is one very reasonable use case: testing a random number
    generator. A useful test is to ensure numbers are uncorrelated, so
    you get 3 random numbers called A, B, C, and you look up A*N*N + B*N
    + C to count the number of times you see A followed by B followed by
    C, where N is the range of the random value, say, 0 - 1023. This
    would be an array of 1 billion 32-bit values. You get 1000 billion
    random numbers, and then look through to make sure most buckets have
    a value around 1000. Any buckets less than 500 or more than 1500
    might be considered a random number generator failure. This is a
    useful test since it intuitively makes sense--if some patterns are
    too likely (or unlikely), then you know you have a problem with your "random" numbers.


    Even if there are no cache hits in access of main histogram, there are
    still cache hits in PRNG that you are testing. Unless that is very
    simple PRNG completely implemented in registers.
    And even in case of very simple PRNG, standard PRNG APIs keep state in memory, so in order to avoid memory accesses=cache hits one would have
    to use non-standard API.

    There are simple PRNGs that create very 'white' RNG sequences. However,
    a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".

    Besides, there are other reasons why modern big OoO will run rounds
    around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect.
    2. Even if compilers were perfect, Cray has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer memory accesses running simultaneously.

    CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
    STs of the scatter type. So, 3 instructions would create 192 memory
    references. And all 192 references could be 'satisfied' in 64-clocks.
    I know of no GBOoO machine with 192 outstanding memory references, and
    fewer still that can satisfy all 192 MRs in 64 clocks.

    The problem was the <ahem> anemic clock rate of ~6ns. Modern CPUs are
    30× faster, and "just as wide".
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From antispam@[email protected] (Waldek Hebisch) to comp.arch on Thu Mar 5 02:54:49 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program with
    a 2 GB array, and you used a random number generate an address
    within it, then added the value at that addressed byte to an
    accumulator. Repeat say 10,000 times. I would call this program
    latency bound, but I suspect Anton would call it bandwidth bound.
    If that is true, then that explains the original differences Anton
    and I had.

    I think in theory, this is not latency bound: assuming enough CPU
    and memory parallelism in the implementation, it can be arbitrarily
    fast. But in practice it will probably be significantly slower than
    if you were to do a sequential traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses might
    be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
    will be horrible!

    Indeed, in practice you may sometimes see the performance be
    correlated with your memory latency, but if so it's only because
    your hardware doesn't offer enough parallelism (e.g. not enough
    memory banks).

    AFAIK, when people say "latency-bound" they usually mean that adding
    parallelism and/or bandwidth to your memory hierarchy won't help
    speed it up (typically because of pointer-chasing). This is
    important, because it's a *lot* more difficult to reduce memory
    latency than it is to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a classic
    Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not fit
    in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does something
    useful and has no cache hits at all. At very least, there will be
    reuse on instruction side. But I think that in order to completely avoid reuse on the data side you'll have to do something non-realistic.

    I remember report from nineties. IIRC NASA folks where comparing
    Cray with Pentium-based machine. Pentium offerred much more flops,
    but on their benchmarks Cray was faster. They wrote that they had
    long vector which did not fit in any cache and Cray memory subsystem
    had better bandwidth. Caches are not bigger, but data is bigger too.

    Note that people usually work hard to better utilize caches. I do
    not know if in the case above there was no way to rewrite code in cache-friendly way or simply it was considerd to be too much work.
    Also, I do not know if lack of reuse was "complete", simply
    memory bandwidth was the bottleneck.
    --
    Waldek Hebisch
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Mar 5 12:07:45 2026
    From Newsgroup: comp.arch

    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:
    Michael S <[email protected]> posted:

    On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
    [email protected] (Kent Dickey) wrote:

    In article <[email protected]>,
    Michael S <[email protected]> wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:
    10K selected from 2G means average distance of 200K, so you
    get effectively very close to zero cache hits, and even TLB
    misses might be very significant unless you've setup huge
    pages.

    Relatively horrible.
    A human time scale it would still be very fast.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does
    not fit in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does
    something useful and has no cache hits at all. At very least,
    there will be reuse on instruction side. But I think that in
    order to completely avoid reuse on the data side you'll have to
    do something non-realistic.

    There is one very reasonable use case: testing a random number
    generator. A useful test is to ensure numbers are uncorrelated, so
    you get 3 random numbers called A, B, C, and you look up A*N*N +
    B*N
    + C to count the number of times you see A followed by B followed
    by C, where N is the range of the random value, say, 0 - 1023.
    This would be an array of 1 billion 32-bit values. You get 1000
    billion random numbers, and then look through to make sure most
    buckets have a value around 1000. Any buckets less than 500 or
    more than 1500 might be considered a random number generator
    failure. This is a useful test since it intuitively makes
    sense--if some patterns are too likely (or unlikely), then you
    know you have a problem with your "random" numbers.


    Even if there are no cache hits in access of main histogram, there
    are still cache hits in PRNG that you are testing. Unless that is
    very simple PRNG completely implemented in registers.
    And even in case of very simple PRNG, standard PRNG APIs keep state
    in memory, so in order to avoid memory accesses=cache hits one
    would have to use non-standard API.

    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated one.
    The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".

    Besides, there are other reasons why modern big OoO will run rounds
    around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect.
    2. Even if compilers were perfect, Cray has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer
    memory accesses running simultaneously.

    CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
    STs of the scatter type. So, 3 instructions would create 192 memory references. And all 192 references could be 'satisfied' in 64-clocks.
    But how many clocks would it take to build gather-scatter list that is
    reused only twice ?
    I know of no GBOoO machine with 192 outstanding memory references, and
    fewer still that can satisfy all 192 MRs in 64 clocks.

    The problem was the <ahem> anemic clock rate of ~6ns. Modern CPUs are
    30� faster, and "just as wide".
    I think that even if its clock rate was 5 MHz, Cray-Y/MP would still be
    beaten by [1 core/thread of] Big OoO because of above-mentioned
    bottleneck. Of course, here I assume that gather still has latency of
    400ns which would be 2000 clocks on our imaginary machine.
    If latency of gather cut to 200 ns then I am no longer sure of the
    outcome. At 100ns Cray likely wins vs. 1 Big OoO core/thread.
    In practice, if the quickness is of major importance, the task could be partially parallelized to take advantage of additional cores present in
    modern gear. But doing so requires programmer's mind. Math people
    that do this type of tests rarely possess one.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Mar 5 12:22:00 2026
    From Newsgroup: comp.arch

    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:


    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated one.
    The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".


    I know how to build extremely good (but not crypto quality) and
    reasonably fast PRNG completely in CPU registers (hint: all modern cores
    can do a round of AES as a single reg-reg instruction). But such PRNG
    can not have standard API.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Torbjorn Lindgren@[email protected] to comp.arch on Thu Mar 5 12:57:28 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> wrote:
    Stephen Fuld <[email protected]d> posted:
    Perhaps that it true now, but it certainly didn't use to be. In
    1979-1980 I wrote the microcode to add caching to my employer's disk
    controller, making it the industry's first true cache disk controller.
    This was almost all about reducing latency (from tens of milliseconds on
    a non cache controller to hundreds of microseconds on a cache hit).
    There was a small improvement in transfer rate, but the latency
    reduction dominated the improvement.

    There is an SSD that can perform 3,300,000×4096B random read transfers
    per second on a PCIe 5.0-×4 connector.

    Samsung PM1753 or Micron 9550? Both advertise 3.3M random 4K reads/s.
    Kioxia CM9 advertise 3.4M random 4K reads so even closer. Might just
    have been tested on a slighty different platform that could get closer
    to the maximum.

    There's a bunch of "value" products in the 2M-2.8M IOPS range, both
    older models from the high end manufacturers above and products from
    more value or "max capacity" oriented brands like Kingston and
    Solidgm.

    And... PCIe 6.0 x4 drives are now here, the Micron 9650 PCI 6.0 x4
    advertises 5.5M random 4K reads/s. I'm guessing this is a respin of
    the 9550 controller with faster PCIe interface, a clock speed bump and
    a bit faster flash speed. Much faster to the market.

    It's not filling the interface but that's AFAIK normal for the first
    generation on a faster interface which is then followed by the full
    redesign in a better process node (higher speed/lower power).

    So I expect we'll see SSDs that can fill the PCIe 6.0 x4 interface
    (6.6-6.8m 4K random read IOPS) in probably 6-12 months - and then the
    "respin for a quick bump then redesign for max performance" cycle
    likely will repeat when PCIe 7.0 becomes generally available in high
    end devices, the final 7.0 spec was released to manufacturers June 11,
    2025 so perhaps that cycle will start somewhere in 2027?.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Mar 5 11:07:40 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    Michael S <[email protected]> posted:

    Besides, there are other reasons why modern big OoO will run rounds
    around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect.
    2. Even if compilers were perfect, Cray has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer memory
    accesses running simultaneously.

    CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
    STs of the scatter type. So, 3 instructions would create 192 memory references. And all 192 references could be 'satisfied' in 64-clocks.
    I know of no GBOoO machine with 192 outstanding memory references, and
    fewer still that can satisfy all 192 MRs in 64 clocks.

    There are quite a few recent papers exploring having huge numbers of
    Miss Status Holding Registers (MSHR) in FPGA's using BRAM's,
    allowing numbers like 2048 concurrent misses. e.g.:

    Stop crying over your cache miss rate:
    Handling efficiently thousands of outstanding misses in fpgas, 2019 https://dl.acm.org/doi/pdf/10.1145/3289602.3293901

    Also some papers exploring optimizing scatter-gathers. e.g.:

    Piccolo: Large-scale graph processing with
    fine-grained in-memory scatter-gather, 2025
    https://arxiv.org/abs/2503.05116


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Thu Mar 5 08:36:24 2026
    From Newsgroup: comp.arch

    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences. However,
    a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.

    Are you aware of any example code on Github? I'd be interested in
    the implementation details of a decent realization of this technique.

    Thank you,

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Thu Mar 5 12:02:13 2026
    From Newsgroup: comp.arch

    Andy Valencia [2026-03-05 08:36:24] wrote:
    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences. However,
    a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.
    Are you aware of any example code on Github?

    What does this have to do with Github?


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@[email protected] (Scott Lurndal) to comp.arch on Thu Mar 5 17:14:50 2026
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    Andy Valencia [2026-03-05 08:36:24] wrote:
    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences. However,
    a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.
    Are you aware of any example code on Github?

    What does this have to do with Github?

    Andy is looking for examples. The most likely place to find them
    this year, would be github (rather than sourceforge et alia).
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Thu Mar 5 19:41:04 2026
    From Newsgroup: comp.arch

    On Thu, 05 Mar 2026 08:36:24 -0800
    Andy Valencia <[email protected]> wrote:

    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated
    one. The table goes a long way in 'whitening' the RNG.

    Are you aware of any example code on Github? I'd be interested in
    the implementation details of a decent realization of this technique.

    Thank you,

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html

    Look at knuth_b source code in your C++ standard library installation.
    BTW, I don't share Mitch's enthusiasm about this technique.
    It is neither very fast nor has exceptional randomness properties,
    whatever it means.
    It seems that in later years Donald Knuth himself recognized that it's
    nothing special.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Mar 5 17:49:35 2026
    From Newsgroup: comp.arch


    Michael S <[email protected]> posted:

    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:

    Michael S <[email protected]> posted:

    On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
    [email protected] (Kent Dickey) wrote:

    In article <[email protected]>,
    Michael S <[email protected]> wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:
    10K selected from 2G means average distance of 200K, so you
    get effectively very close to zero cache hits, and even TLB
    misses might be very significant unless you've setup huge
    pages.

    Relatively horrible.
    A human time scale it would still be very fast.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does
    not fit in classic Cray's main memory.

    Besides, it is nearly impossible to create a code that does
    something useful and has no cache hits at all. At very least,
    there will be reuse on instruction side. But I think that in
    order to completely avoid reuse on the data side you'll have to
    do something non-realistic.

    There is one very reasonable use case: testing a random number generator. A useful test is to ensure numbers are uncorrelated, so
    you get 3 random numbers called A, B, C, and you look up A*N*N +
    B*N
    + C to count the number of times you see A followed by B followed
    by C, where N is the range of the random value, say, 0 - 1023.
    This would be an array of 1 billion 32-bit values. You get 1000 billion random numbers, and then look through to make sure most
    buckets have a value around 1000. Any buckets less than 500 or
    more than 1500 might be considered a random number generator
    failure. This is a useful test since it intuitively makes
    sense--if some patterns are too likely (or unlikely), then you
    know you have a problem with your "random" numbers.


    Even if there are no cache hits in access of main histogram, there
    are still cache hits in PRNG that you are testing. Unless that is
    very simple PRNG completely implemented in registers.
    And even in case of very simple PRNG, standard PRNG APIs keep state
    in memory, so in order to avoid memory accesses=cache hits one
    would have to use non-standard API.

    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated one.
    The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".

    Besides, there are other reasons why modern big OoO will run rounds around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect.
    2. Even if compilers were perfect, Cray has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer
    memory accesses running simultaneously.

    CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
    STs of the scatter type. So, 3 instructions would create 192 memory references. And all 192 references could be 'satisfied' in 64-clocks.

    But how many clocks would it take to build gather-scatter list that is
    reused only twice ?

    In sparse Matrix "stuff" the gather/scatter list is used lots of times ameliorating the build time.

    I know of no GBOoO machine with 192 outstanding memory references, and fewer still that can satisfy all 192 MRs in 64 clocks.


    The problem was the <ahem> anemic clock rate of ~6ns. Modern CPUs are
    30× faster, and "just as wide".

    I think that even if its clock rate was 5 MHz, Cray-Y/MP would still be beaten by [1 core/thread of] Big OoO because of above-mentioned
    bottleneck. Of course, here I assume that gather still has latency of
    400ns which would be 2000 clocks on our imaginary machine.

    LDs had a latency of ~20 clocks (early CRAYs) to mid-30s-clocks (Y-MP)
    only delayed by bank conflicts. Or 80-120ns.

    If latency of gather cut to 200 ns then I am no longer sure of the
    outcome. At 100ns Cray likely wins vs. 1 Big OoO core/thread.
    In practice, if the quickness is of major importance, the task could be partially parallelized to take advantage of additional cores present in modern gear.

    By the time of the Y-MP, there were up to 8 CPUs, each of which could
    "emit" 3 memory refs per cycle into 256-banked memory with a "coordination" register set for fast hand-off of new blocks of work.

    But doing so requires programmer's mind. Math people
    that do this type of tests rarely possess one.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Thu Mar 5 18:10:11 2026
    From Newsgroup: comp.arch


    Andy Valencia <[email protected]> posted:

    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences. However,
    a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.

    Are you aware of any example code on Github? I'd be interested in
    the implementation details of a decent realization of this technique.

    Roughly:

    # define shift (8)
    # define mask ((~0)<<shift)
    static unsigned tblindx = 0;
    static unsigned table[ 1<<shift ];
    extern unsigned PRNG();

    unsigned WhiteRNG()
    {
    if( tblindx == 0 ) tblindx = PRNG();
    index = tblindx & mask;
    tblindx >>= shift;
    RNG = table[ index ];
    table[ index ] = PRNG();
    return RNG;
    }

    will whiten any reasonable PRNG. There are a variety of ways to index
    the table that make little difference in the outcome. 20 years ago I
    knew the name for this, but I could not find it on my shelf.


    Thank you,

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Thu Mar 5 14:18:51 2026
    From Newsgroup: comp.arch

    Scott Lurndal [2026-03-05 17:14:50] wrote:
    Stefan Monnier <[email protected]> writes:
    Andy Valencia [2026-03-05 08:36:24] wrote:
    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences. However, >>>> a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.
    Are you aware of any example code on Github?
    What does this have to do with Github?
    Andy is looking for examples. The most likely place to find them
    this year, would be github (rather than sourceforge et alia).

    That doesn't explain why the question is for examples specifically on Github.


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Thu Mar 5 14:47:23 2026
    From Newsgroup: comp.arch

    EricP wrote:
    MitchAlsup wrote:
    Michael S <[email protected]> posted:

    Besides, there are other reasons why modern big OoO will run rounds
    around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect. 2. Even if compilers were perfect, Cray
    has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer memory
    accesses running simultaneously.

    CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
    STs of the scatter type. So, 3 instructions would create 192 memory
    references. And all 192 references could be 'satisfied' in 64-clocks.
    I know of no GBOoO machine with 192 outstanding memory references, and
    fewer still that can satisfy all 192 MRs in 64 clocks.

    There are quite a few recent papers exploring having huge numbers of
    Miss Status Holding Registers (MSHR) in FPGA's using BRAM's,
    allowing numbers like 2048 concurrent misses. e.g.:

    Stop crying over your cache miss rate:
    Handling efficiently thousands of outstanding misses in fpgas, 2019 https://dl.acm.org/doi/pdf/10.1145/3289602.3293901

    Also some papers exploring optimizing scatter-gathers. e.g.:

    Piccolo: Large-scale graph processing with
    fine-grained in-memory scatter-gather, 2025
    https://arxiv.org/abs/2503.05116

    Thinking about optimizing scatter-gathers...

    The current design is optimized for moving 64B+ECC cache lines.
    The DIMM has 9 DDR DRAM chips which are read at the same row
    and columns in parallel, with 4 clocks to read a 64/72 ECC word,
    and 8 times that = 32 clocks for a 64B-ECC line.
    It moves to the cache (more clocks) and moves up through the cache
    hierarchy to the core where we finally extract/insert 1 data item.

    But for scatter-gather (SG) we typically only need one data item
    of 4 or 8 bytes out of that 64B line so most of that was waste.
    Furthermore the DRAMs on a DIMM are all being used serially
    at the same time, with no concurrency.

    It could be possible to send an SG read or write packet containing up to
    64 64-bit physical addresses + data for writes to the memory controller,
    and have it perform the whole scatter gather optimally.

    Internally each DRAM chip is composed of some number of banks,
    each bank with some number of subarrays, each subarray with a set of independent sense amps and latches, of rows and columns of bits.

    It could be possible to open a separate row on each subarray in each bank.
    It could be possible to read/write multiple subarrays at the same time
    with each subarray IO routed to a separate pin.

    The SG memory controller would be redesigned to read and write
    multiple 72 bit data items concurrently by:
    - instead of storing individual bits of 9-bit bytes (data+parity) spread
    across separate DRAM chips on a DIMM, store those 9bytes as bits in
    the same row of a single subarray of a single DRAM chip.
    - change the DRAM row size to be a multiple of 9-bit bytes
    - a 72-bit word would be split as two 36-bit half words across pairs of
    subarrays both read at once, which using DDR would be 18 clocks.
    (Compared to the 32 clocks for reading whole 64B cache lines.)
    - multiple subarray pairs can be opened and read/written concurrently
    - each DIMM has 8 DRAM chips, and each chip can have all subarray pairs
    open at different rows, and can read/write multiple 72-bit words
    to different subarray pairs a once.

    If each DIMM has 8 DRAM chips, and each chip had 8 subarray IO pins,
    this could perform a 64-way scatter-gather of 72-bit words in as little
    as 18 clocks with data words in separate subarrays. Or a worst case where
    all the physical addresses land in the same subarray, 64*18 clocks.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Fri Mar 6 11:34:11 2026
    From Newsgroup: comp.arch

    Stefan Monnier <[email protected]> writes:
    ...
    What does this have to do with Github?

    It's the industry standard way to share code?

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From kegs@[email protected] (Kent Dickey) to comp.arch on Fri Mar 6 19:52:07 2026
    From Newsgroup: comp.arch

    In article <[email protected]>,
    MitchAlsup <[email protected]d> wrote:

    Andy Valencia <[email protected]> posted:

    MitchAlsup <[email protected]d> writes:
    There are simple PRNGs that create very 'white' RNG sequences. However,
    a generated RN is used to index a table of previously computed RNs,
    and then swap the accessed one with the generated one. The table goes
    a long way in 'whitening' the RNG.

    Are you aware of any example code on Github? I'd be interested in
    the implementation details of a decent realization of this technique.

    Roughly:

    # define shift (8)
    # define mask ((~0)<<shift)
    static unsigned tblindx = 0;
    static unsigned table[ 1<<shift ];
    extern unsigned PRNG();

    unsigned WhiteRNG()
    {
    if( tblindx == 0 ) tblindx = PRNG();
    index = tblindx & mask;
    tblindx >>= shift;
    RNG = table[ index ];
    table[ index ] = PRNG();
    return RNG;
    }

    will whiten any reasonable PRNG. There are a variety of ways to index
    the table that make little difference in the outcome. 20 years ago I
    knew the name for this, but I could not find it on my shelf.


    Thank you,

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html

    This is a modification of Knuth Algorithm M in section 3.2.2 of The Art
    of Computer Programming, Vol. 2, Seminumerical Algorithms, where there
    is a detailed discussion on it.

    It does help to make simple LFSR PRNGs much better, but XORSHIFT128P (or
    others like it) is better for 64-bit CPUs (see https://en.wikipedia.org/wiki/Xorshift).

    Kent
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Fri Mar 6 20:08:34 2026
    From Newsgroup: comp.arch


    EricP <[email protected]> posted:

    EricP wrote:
    MitchAlsup wrote:
    Michael S <[email protected]> posted:

    Besides, there are other reasons why modern big OoO will run rounds
    around Cray, either 1 or 2, in that sort of test. Most important are
    1. Compilers are not perfect. 2. Even if compilers were perfect, Cray >>> has far fewer physical
    register than Big OoO, which would inevitably lead to far fewer memory >>> accesses running simultaneously.

    CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
    STs of the scatter type. So, 3 instructions would create 192 memory
    references. And all 192 references could be 'satisfied' in 64-clocks.
    I know of no GBOoO machine with 192 outstanding memory references, and
    fewer still that can satisfy all 192 MRs in 64 clocks.

    There are quite a few recent papers exploring having huge numbers of
    Miss Status Holding Registers (MSHR) in FPGA's using BRAM's,
    allowing numbers like 2048 concurrent misses. e.g.:

    Stop crying over your cache miss rate:
    Handling efficiently thousands of outstanding misses in fpgas, 2019 https://dl.acm.org/doi/pdf/10.1145/3289602.3293901

    Also some papers exploring optimizing scatter-gathers. e.g.:

    Piccolo: Large-scale graph processing with
    fine-grained in-memory scatter-gather, 2025 https://arxiv.org/abs/2503.05116

    Thinking about optimizing scatter-gathers...

    The current design is optimized for moving 64B+ECC cache lines.
    The DIMM has 9 DDR DRAM chips which are read at the same row
    and columns in parallel, with 4 clocks to read a 64/72 ECC word,

    64/72 bits pop out every ½ cycle after a 20-30ns delay (CAS)

    and 8 times that = 32 clocks for a 64B-ECC line.

    So, the whole cache line is transferred in 4 clocks (8 doublewords).

    It moves to the cache (more clocks) and moves up through the cache
    hierarchy to the core where we finally extract/insert 1 data item.

    But for scatter-gather (SG) we typically only need one data item
    of 4 or 8 bytes out of that 64B line so most of that was waste.

    Scatter/gather typically reads a dense strided array of pointers/indexes
    and then uses all of these. Each pointer/index accesses <typically> 1 word/doubleword. So, the lists are dense, but the data is sparse.

    Furthermore the DRAMs on a DIMM are all being used serially
    at the same time, with no concurrency.

    Half of the data gets good hit rates, the other half gets poor hit rates.

    It could be possible to send an SG read or write packet containing up to
    64 64-bit physical addresses + data for writes to the memory controller,
    and have it perform the whole scatter gather optimally.

    This would require a Load-Indirect-Vector instruction, rather than a
    LD of the pointer/index followed by a LD/ST of the data itself;
    which is useful on many many more codes than scatter/gather.

    Internally each DRAM chip is composed of some number of banks,

    4-16

    each bank with some number of subarrays, each subarray with a set of independent sense amps and latches, of rows and columns of bits.

    Not how they are organized. A DRAM bank is indexed by RAS, and RAS
    reads out and latches bits in sense amps. CAS accesses bits from those
    sense amps. As long as the sense amp is active, data is being refreshed
    in the cells. A single DRAM chip might have 4096-32768 bits latched in
    its sense amp. But it only has a few sets of these sense amps and calls
    then Banks...

    It could be possible to open a separate row on each subarray in each bank.
    It could be possible to read/write multiple subarrays at the same time
    with each subarray IO routed to a separate pin.

    The SG memory controller would be redesigned to read and write
    multiple 72 bit data items concurrently by:
    - instead of storing individual bits of 9-bit bytes (data+parity) spread
    across separate DRAM chips on a DIMM, store those 9bytes as bits in
    the same row of a single subarray of a single DRAM chip.
    - change the DRAM row size to be a multiple of 9-bit bytes
    - a 72-bit word would be split as two 36-bit half words across pairs of
    subarrays both read at once, which using DDR would be 18 clocks.
    (Compared to the 32 clocks for reading whole 64B cache lines.)
    - multiple subarray pairs can be opened and read/written concurrently
    - each DIMM has 8 DRAM chips, and each chip can have all subarray pairs
    open at different rows, and can read/write multiple 72-bit words
    to different subarray pairs a once.

    If each DIMM has 8 DRAM chips, and each chip had 8 subarray IO pins,
    this could perform a 64-way scatter-gather of 72-bit words in as little
    as 18 clocks with data words in separate subarrays. Or a worst case where
    all the physical addresses land in the same subarray, 64*18 clocks.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Fri Mar 6 14:06:15 2026
    From Newsgroup: comp.arch

    On 3/4/2026 2:15 PM, Scott Lurndal wrote:
    Michael S <[email protected]> writes:
    On Wed, 4 Mar 2026 07:44:39 -0800
    Stephen Fuld <[email protected]d> wrote:

    On 3/3/2026 1:55 PM, Scott Lurndal wrote:
    Stephen Fuld <[email protected]d> writes: =20
    On 3/3/2026 11:01 AM, MitchAlsup wrote: =20

    Stephen Fuld <[email protected]d> posted:
    =20
    On 3/1/2026 1:13 PM, MitchAlsup wrote: =20

    Stephen Fuld <[email protected]d> posted: =20

    snip
    =20
    cpu designers minimize latency at a given BW, while
    Long term store designers maximize BW at acceptable latency.
    Completely different design points. =20

    Perhaps that it true now, but it certainly didn't use to be. In >>>>>>> 1979-1980 I wrote the microcode to add caching to my employer's
    disk controller, making it the industry's first true cache disk
    controller. This was almost all about reducing latency (from
    tens of milliseconds on a non cache controller to hundreds of
    microseconds on a cache hit). There was a small improvement in
    transfer rate, but the latency reduction dominated the
    improvement. =20

    There is an SSD that can perform 3,300,000=D74096B random read
    transfers per second on a PCIe 5.0-=D74 connector. That is 13.2GB/s >>>>>> over the PCIe link which is BW limited to 15.x GB/s. Each "RAS"
    has a 70=B5s access delay. =20

    Wow! But I think you will agree that design is unlikely to be
    used for "mass market" hundreds of terabyte systems used for
    commercial database systems, etc. =20
    =20
    I think you'll find that the commercial databases are dominated
    by high-end NVMe (PCI based SSDs) for working storage, with
    spinning rust as archive storage.
    =20
    For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
    =20
    https://techatlantix.com/mzwmo61thclf-00aw7.html =20
    =20
    I freely admit that I am "out of the loop" for modern systems.
    However, this system costs about $125/TB. A quick check of Amazon
    shows typical hard disk prices at about $25 /TB.=20

    That's not a fare comparison.

    Indeed, it's not even a fair comparision :-)

    $25/TB you see on Amazon is likely 5400 rpm 4TB disk.
    So, at 61 TB you only have 15 spindles. It means that even in ideal
    conditions of accesses distributed evenly to all disks your sequential
    read bandwidth is no better than ~1,200 MB/s.
    For comparison, sequential read speed of BM1743 SSD is 7,500 MB/s.

    In order to get comparable bandwidth with HDs you will need
    95 spindles. May be, a little less with 7200 rpm. I don't know how many
    spindles would you need with 13000 rpm HDs that enterprises used to use
    for databases 20+ years ago. It seems, they had better latency than
    7200, but about the same bandwidth. Anyway, I m not even sure that
    anybody makes them still.

    95 disks alone almost certainly cost more than 8KUSD. And on top of
    that you will need a dozen of expensive RAID controllers.

    So, essentially, when you paid 8KUSD for this SSD, you paid it for
    bandwidth alone. 100x improvement in latency that you got over HD-based
    solution is a free bonus. Density and power too.

    Density and power are the most important criteria in modern
    datacenters. Reliability is also a consideration. While
    the backblaze drive reports show moderately reasonable results
    for most spinning rust, the NVME SSD is far more reliable and
    consumes far less power and rack space than the equivalent
    hard disk would.

    Another advantage of NVMe cards is the availablity of
    PCI SR-IOV, which allows the NVMe card to be partitioned
    and made available to multiple independent guests without
    sacrificing security. The downside of host-based NVMe is
    the inability to share bandwidth with multiple hosts. That
    downside is eliminated with external NVMe based RAID
    subsystems connected via 400Gbe or FC.

    https://www.truenas.com/r-series/r60/

    7PB at 60GB/sec.

    Idle:
    Wonders if some of the approaches used in SSDs could be used to make higher-density DRAM.


    Say, rather than 1-bit per DRAM cell, they can hold multiple bits.
    Granted, this might require the RAM to implement its own refresh process
    to retain stability; and possibly some changes to the RAM protocols
    (events like selecting rows may need to be made to be able to handle
    variable latency).

    Say, for example, rather than changing a row and waiting a fixed RAS
    latency, it is changing a row and waiting for the RAM to signal back
    that the active row has been changed. Likely CAS could remain fixed
    latency though.

    Say, for example, if RAS moves a row into an internal SRAM cache, with
    CAS accessing this SRAM. Closing the row or changing rows then writing
    these back to the internal DRAM cells, with a per-row autonomous refresh
    built into the RAM.

    Like with SSDs, there could probably be invisible ECC in the RAM to deal
    with cases where the DRAM cells deteriorate between refresh cycles.

    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Mar 7 01:49:31 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do it.

    It comes with 2 consequences::
    a) RAS time would at least double (30ns->60ns) but likely do 4×
    b) refresh rate would go up by factor of 4× minimum.

    Say, rather than 1-bit per DRAM cell, they can hold multiple bits.
    Granted, this might require the RAM to implement its own refresh process
    to retain stability; and possibly some changes to the RAM protocols
    (events like selecting rows may need to be made to be able to handle variable latency).

    Say, for example, rather than changing a row and waiting a fixed RAS latency, it is changing a row and waiting for the RAM to signal back
    that the active row has been changed. Likely CAS could remain fixed
    latency though.

    Say, for example, if RAS moves a row into an internal SRAM cache, with
    CAS accessing this SRAM. Closing the row or changing rows then writing
    these back to the internal DRAM cells, with a per-row autonomous refresh built into the RAM.

    Like with SSDs, there could probably be invisible ECC in the RAM to deal with cases where the DRAM cells deteriorate between refresh cycles.

    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat Mar 7 13:21:50 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> schrieb:

    Cray-2 memory was big enough, but was it fast enough latency wise?
    All info I see about Cray-2 memory praises its great capacity and
    bandwidth, but tells nothing about latency.
    It seems, the latency was huge and the whole system was useful only due
    to mechanism that today we will call hardware prefetch. Or software
    prefetch? I am not sure.
    But it is possible that I misunderstood.

    The Cray architecture had vector registers, which was an advantage
    over the memory-to-memory architectures like the Cyber 205.
    And it could run calculations in parallel with memory operations.

    I guess you cann call loading a vector register from memory a
    "software prefectch" if you want, but it would be a bit of a
    stretch of the terminology.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sat Mar 7 13:29:00 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> schrieb:
    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:


    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated one.
    The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".


    I know how to build extremely good (but not crypto quality) and
    reasonably fast PRNG completely in CPU registers (hint: all modern cores
    can do a round of AES as a single reg-reg instruction). But such PRNG
    can not have standard API.

    You can use "darn" on POWER, of course :-)

    But why no standard API?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Mar 7 19:03:58 2026
    From Newsgroup: comp.arch

    On Sat, 7 Mar 2026 13:21:50 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:

    Cray-2 memory was big enough, but was it fast enough latency wise?
    All info I see about Cray-2 memory praises its great capacity and bandwidth, but tells nothing about latency.
    It seems, the latency was huge and the whole system was useful only
    due to mechanism that today we will call hardware prefetch. Or
    software prefetch? I am not sure.
    But it is possible that I misunderstood.

    The Cray architecture had vector registers, which was an advantage
    over the memory-to-memory architectures like the Cyber 205.
    And it could run calculations in parallel with memory operations.

    I guess you cann call loading a vector register from memory a
    "software prefectch" if you want, but it would be a bit of a
    stretch of the terminology.

    I know how it works on Cray-1.
    I don't know how it works on Cray-2, except that I know that it is
    different.
    You response sounds like you knowoledge of Cray-2 is not deeper than
    mine.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Mar 7 19:19:28 2026
    From Newsgroup: comp.arch

    On Sat, 7 Mar 2026 13:29:00 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:


    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated
    one. The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".


    I know how to build extremely good (but not crypto quality) and
    reasonably fast PRNG completely in CPU registers (hint: all modern
    cores can do a round of AES as a single reg-reg instruction). But
    such PRNG can not have standard API.

    You can use "darn" on POWER, of course :-)

    But why no standard API?

    Standard PRNG APIs assume that PRNG state is stored in some type of
    object. In older, more primitive APIs, like C RTL rand/srand, there is
    one global object. In more modern APIs, user can have as many objects
    as he wishes, but they are still object, somewhere in memory. One API
    call delivers one random number and updates an object (or the object).
    So, there is inevetably at least one memory read and one memory write
    going on per each API call. In scenario described by Kent Dickey acesses
    to PRNG object will have very good temporal locality. Which means that
    on CPUs with cache they will have very good hit rate. Which means that
    CPUs with cache will do that part of test much faster than cacheless
    Cray processors.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sat Mar 7 19:07:39 2026
    From Newsgroup: comp.arch


    Michael S <[email protected]> posted:

    On Sat, 7 Mar 2026 13:29:00 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:


    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated
    one. The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".


    I know how to build extremely good (but not crypto quality) and reasonably fast PRNG completely in CPU registers (hint: all modern
    cores can do a round of AES as a single reg-reg instruction). But
    such PRNG can not have standard API.

    You can use "darn" on POWER, of course :-)

    But why no standard API?

    Standard PRNG APIs assume that PRNG state is stored in some type of
    object. In older, more primitive APIs, like C RTL rand/srand, there is
    one global object. In more modern APIs, user can have as many objects
    as he wishes, but they are still object, somewhere in memory. One API
    call delivers one random number and updates an object (or the object).

    Two questions::

    a) It seems to me that a PRNG being called from several threads (in a non-deterministic order) would be whiter than when each thread has its
    own PRGN 'state'. What say yee ?

    b) Why not put this PRNG state in Thread-Local-Store ?? if you want
    each thread to have its own PRNG-state ?

    So, there is inevetably at least one memory read and one memory write
    going on per each API call.

    I still don't see why no standard API !

    In scenario described by Kent Dickey acesses
    to PRNG object will have very good temporal locality

    to a single thread

    . Which means that
    on CPUs with cache they will have very good hit rate. Which means that
    CPUs with cache will do that part of test much faster than cacheless
    Cray processors.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Mar 7 21:21:47 2026
    From Newsgroup: comp.arch

    On Sat, 7 Mar 2026 13:29:00 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Wed, 04 Mar 2026 23:46:46 GMT
    MitchAlsup <[email protected]d> wrote:


    There are simple PRNGs that create very 'white' RNG sequences.
    However, a generated RN is used to index a table of previously
    computed RNs, and then swap the accessed one with the generated
    one. The table goes a long way in 'whitening' the RNG.

    So, good PRNGs are not memory reference free on the data side. But,
    on the other hand, the table does not have to be "that big".


    I know how to build extremely good (but not crypto quality) and
    reasonably fast PRNG completely in CPU registers (hint: all modern
    cores can do a round of AES as a single reg-reg instruction). But
    such PRNG can not have standard API.

    You can use "darn" on POWER, of course :-)


    darn is intended for seeding secure PRMG, not unlike Intel/AMD RDSEED.
    It is not suitable *to be* PRMG for at least two reasons.

    But why no standard API?


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From George Neuner@[email protected] to comp.arch on Sat Mar 7 16:03:26 2026
    From Newsgroup: comp.arch

    On Fri, 06 Mar 2026 11:34:11 -0800, Andy Valencia <[email protected]>
    wrote:

    Stefan Monnier <[email protected]> writes:
    ...
    What does this have to do with Github?

    It's the industry standard way to share code?

    Github simply has had the most publicity in recent memory. That in no
    way makes it a "standard". A /LOT/ of people abandoned it (and
    projects hosted on it) because of its association with Microsoft.

    Also not everyone likes git, and Github's svn "gateway" is all but
    useless. [Not that I care, but it's yet another reason not to like
    Github.]


    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sat Mar 7 15:03:26 2026
    From Newsgroup: comp.arch

    On 3/6/2026 7:49 PM, MitchAlsup wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to make
    higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do it.

    It comes with 2 consequences::
    a) RAS time would at least double (30ns->60ns) but likely do 4×
    b) refresh rate would go up by factor of 4× minimum.


    This is why I was thinking for such RAM it may be necessary to tweak the
    DDR protocols slightly to be able to deal with variable RAS timings
    (and/or just have much larger RAS and very aggressive RAM refresh).

    But, could potentially allow for bigger/cheaper RAM (say, what if
    someone could afford 512GB or 1TB of RAM for a computer?...). We can do
    it well enough for SSDs.



    Though, may make sense to make the RAM actually functionally more like
    Flash storage than traditional DRAM (capacitor driven), just likely
    using plain MOSFETs rather than FGMOS.


    But, I am left to realize I am not actually sure how things like TLC/QLC
    Flash manage to drive the voltage to a specific level needed for a given
    bit pattern. I would assume that the value is measured via the resistive voltage drop on the bit-line or similar, but even then I am not sure.


    If going this route, would need a way to drive the gate voltage to a
    specific level and then disconnect the gate when not being driven.

    For FGMOS, could use voltage differences to drive the writing processes,
    but for normal MOSFETs would need a different mechanism (possibly
    drive-up and drive down lines along with Vcc/Gnd/Select lines).

    So, say, to write to a cell, you connect the Vcc/Gnd lines for that row,
    and then assert the Up/Down signal for that column. Maybe the Read
    bit-line could also be monitored, adjusting the voltage up/down as
    needed until it reads correctly (if neither is driven, the gate is left floating).

    Could maybe be done with 3 transistors/cell, but maybe this is too
    many?... But, could have unlimited write cycles (more useful for RAM).


    Say, rather than 1-bit per DRAM cell, they can hold multiple bits.
    Granted, this might require the RAM to implement its own refresh process
    to retain stability; and possibly some changes to the RAM protocols
    (events like selecting rows may need to be made to be able to handle
    variable latency).

    Say, for example, rather than changing a row and waiting a fixed RAS
    latency, it is changing a row and waiting for the RAM to signal back
    that the active row has been changed. Likely CAS could remain fixed
    latency though.

    Say, for example, if RAS moves a row into an internal SRAM cache, with
    CAS accessing this SRAM. Closing the row or changing rows then writing
    these back to the internal DRAM cells, with a per-row autonomous refresh
    built into the RAM.

    Like with SSDs, there could probably be invisible ECC in the RAM to deal
    with cases where the DRAM cells deteriorate between refresh cycles.

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sat Mar 7 23:48:34 2026
    From Newsgroup: comp.arch

    On Wed, 4 Mar 2026 11:49:28 -0800
    Stephen Fuld <[email protected]d> wrote:

    On 3/4/2026 11:17 AM, Michael S wrote:
    On Wed, 4 Mar 2026 19:38:08 +0100
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Wed, 4 Mar 2026 19:03:54 +0100
    Terje Mathisen <[email protected]> wrote:

    Stefan Monnier wrote:
    Let me try again. Suppose you had a (totally silly) program
    with a 2 GB array, and you used a random number generate an
    address within it, then added the value at that addressed byte
    to an accumulator. Repeat say 10,000 times. I would call
    this program latency bound, but I suspect Anton would call it
    bandwidth bound. If that is true, then that explains the
    original differences Anton and I had.

    I think in theory, this is not latency bound: assuming enough
    CPU and memory parallelism in the implementation, it can be
    arbitrarily fast. But in practice it will probably be
    significantly slower than if you were to do a sequential
    traversal.

    10K selected from 2G means average distance of 200K, so you get
    effectively very close to zero cache hits, and even TLB misses
    might be very significant unless you've setup huge pages.


    Relatively horrible.
    A human time scale it would still be very fast.

    Assuming TLB+$L3+$L2+$L1 misses on every access the actual
    runtime will be horrible!

    Indeed, in practice you may sometimes see the performance be
    correlated with your memory latency, but if so it's only because
    your hardware doesn't offer enough parallelism (e.g. not enough
    memory banks).

    AFAIK, when people say "latency-bound" they usually mean that
    adding parallelism and/or bandwidth to your memory hierarchy
    won't help speed it up (typically because of pointer-chasing).
    This is important, because it's a *lot* more difficult to reduce
    memory latency than it is to add bandwidth or parallelism.

    When the working set does not allow any cache re-use, then a
    classic Cray could perform much better than a modern OoO cpu.

    Terje


    When working set does not allow any cache re-use then it does not
    fit in classic Cray's main memory.

    The 1985 Cray-2 allowed 2GB, so theoretically possible with the
    OS+program into the 73 MB gap between 2GiB and 2E9.

    Easily done on a later Cray-Y-MP.


    I had Cray-1 in mind.

    Cray-2 memory was big enough, but was it fast enough latency wise?
    All info I see about Cray-2 memory praises its great capacity and bandwidth, but tells nothing about latency.
    It seems, the latency was huge and the whole system was useful only
    due to mechanism that today we will call hardware prefetch. Or
    software prefetch? I am not sure.
    But it is possible that I misunderstood.

    Well, Cray had that vector thing going for it. :-) And, as Mitch
    has repeatedly pointed out, the memory bandwidth to support it. And "reasonable" memory latency for the non-vector operations.


    Pay attention, I am talking about Cray-2.
    Unlike Cray-1, Cray X-MP and Cray Y-MP, Cray-2 had seen relativle
    little use outside of DoE and DoD resarch centers.
    Not a lot was published about it.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sun Mar 8 00:28:06 2026
    From Newsgroup: comp.arch

    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do it.


    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
    Then you connect your storage capacitor to bit line an it pulls it
    slightly up or pushes it slightly down. Then open-loop sense amplifier
    detects this slight change in voltage of bit line.

    I don't see how any of that will work to distinguish between 4 levels of volatge instead of 2 levels. Much less so, to distinguish between 16
    levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand enough
    to know that flash cell's charge is not discharged into higher
    capacitance of bit line on every read operation.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Andy Valencia@[email protected] to comp.arch on Sat Mar 7 15:53:58 2026
    From Newsgroup: comp.arch

    [email protected] (Kent Dickey) writes:
    Roughly:
    ...
    This is a modification of Knuth Algorithm M in section 3.2.2 of The Art
    of Computer Programming, Vol. 2, Seminumerical Algorithms, where there
    is a detailed discussion on it.

    It does help to make simple LFSR PRNGs much better, but XORSHIFT128P (or others like it) is better for 64-bit CPUs (see https://en.wikipedia.org/wiki/Xorshift).

    Thank you!

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Mar 8 08:27:04 2026
    From Newsgroup: comp.arch

    Thomas Koenig <[email protected]> writes:
    I guess you cann call loading a vector register from memory a
    "software prefectch" if you want, but it would be a bit of a
    stretch of the terminology.

    That's like saying that Humpty Dumpty stretched the terminology of
    "glory".

    Software prefetch instructions are architectural noops.
    Microarchitecturally, they may load the accessed memory into a cache.
    Likewise, hardware prefetchers load some memory into a cache.

    Loading a vector register from memory has an architectural effect.
    And given that the classic Crays don't have caches, prefetching
    instructions would always be noops.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Sun Mar 8 05:16:48 2026
    From Newsgroup: comp.arch

    Tonight’s tradeoff was having the memory page size determined by the
    root pointer. A few bits (5) in a root pointer could be used to set the
    page size. All references through that root pointer would then use the specified page size. When the root pointer changes, the page size goes
    along with it.

    I think not flushing the TLB could be got away with, with ASID matching
    on the entries. For a given ASID the page size would be consistent with
    the root pointer.

    Alternately the TLB entry could be tagged with the root pointer register number, so if a different root pointer register is used the entry would
    not match.

    I have been studying the 68851 MMU. Quite complex compared to some other
    MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Sun Mar 8 12:13:24 2026
    From Newsgroup: comp.arch

    On 07/03/2026 23:28, Michael S wrote:
    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do it.


    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
    Then you connect your storage capacitor to bit line an it pulls it
    slightly up or pushes it slightly down. Then open-loop sense amplifier detects this slight change in voltage of bit line.

    I don't see how any of that will work to distinguish between 4 levels of volatge instead of 2 levels. Much less so, to distinguish between 16
    levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand enough
    to know that flash cell's charge is not discharged into higher
    capacitance of bit line on every read operation.


    In theory you could have your DRAM cell hold different voltage levels.
    Your description of how DRAM works is mostly fine (AFAIUI - I am not a
    chip designer), except that I think the read sense amplifiers must have
    much lower input capacitance than the cells storage capacitors.

    It would certainly be possible to put the write line at four different
    levels rather than just high or low - just like you can do with a flash
    cell. The trouble is that where the flash cell has extremely low
    leakage of its stored charge, DRAM cells use a small capacitor and have
    lots of leakage. As soon as you stop writing, the capacitor charge
    leaks back and forth to ground, to positive supply, to control lines, to neighbour cells, and so on. Reducing the voltage change from that
    leakage means using a bigger capacitor, which is slower to write.

    All this means that some time after you have written to a DRAM cell, the voltage on the cell capacitor is different from what you wrote. You
    have to read it, and re-write it, before the voltage changes enough to
    be misinterpreted. And the very act of reading changes the cell's
    charge too, depending on leakage to the nearby read lines, and the state
    of the input capacitance on the read lines. Some of all these leakages, changes and influences is relatively predictable from the layout of the
    chips - other parts vary significantly depending on temperature, the
    values in neighbouring cells, and the read/write patterns. (Remember "Rowhammer" ?)

    Storing 2 bits per cells makes all this hugely worse. And that means
    you need bigger storage capacitors to reduce the effect - making the
    cells bigger, slower and requiring more power. Your read and write
    circuitry becomes significantly more complex (and big, slow, and power hungry). And you have to have far shorter refresh cycles (you guessed
    it - it makes things bigger, slower and more power-hungry).

    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than twice
    the size as well as several times slower and demanding much more power. Clearly it would not be a good idea. But you /could/ do it.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sun Mar 8 13:15:34 2026
    From Newsgroup: comp.arch

    On Sun, 08 Mar 2026 08:27:04 GMT
    [email protected] (Anton Ertl) wrote:

    Thomas Koenig <[email protected]> writes:
    I guess you cann call loading a vector register from memory a
    "software prefectch" if you want, but it would be a bit of a
    stretch of the terminology.

    That's like saying that Humpty Dumpty stretched the terminology of
    "glory".

    Software prefetch instructions are architectural noops.
    Microarchitecturally, they may load the accessed memory into a cache. Likewise, hardware prefetchers load some memory into a cache.

    Loading a vector register from memory has an architectural effect.
    And given that the classic Crays don't have caches, prefetching
    instructions would always be noops.

    - anton

    Cray-2 had no cache, same as 1, X-MP and Y-MP. Nevertheless, unlike
    1, and according to my understanding, unlike X-MP and Y-MP, memory
    hierarchy of Cray-2 was not flat.

    Here is a paragraph from Wikipedia:
    "To avoid this problem the new design banked memory and two sets of
    registers (the B- and T-registers) were replaced with a 16 KWord block
    of the very fastest memory possible called a Local Memory, not a cache, attaching the four background processors to it with separate high-speed
    pipes. This Local Memory was fed data by a dedicated foreground
    processor which was in turn attached to the main memory through a
    Gbit/s channel per CPU; X-MPs by contrast had three, for two
    simultaneous loads and a store and Y-MP/C-90s had five channels to
    avoid the von Neumann bottleneck. It was the foreground processor's
    task to "run" the computer, handling storage and making efficient use
    of the multiple channels into main memory. It drove the background
    processors by passing in the instructions they should run via eight
    16-word buffers, instead of tying up the existing cache pipes to the
    background processors. Modern CPUs use a variation of this design as
    well, although the foreground processor is now referred to as the
    load/store unit and is not a complete machine unto its own."


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sun Mar 8 13:37:41 2026
    From Newsgroup: comp.arch

    On Sun, 8 Mar 2026 12:13:24 +0100
    David Brown <[email protected]> wrote:

    On 07/03/2026 23:28, Michael S wrote:
    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do
    it.

    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
    Then you connect your storage capacitor to bit line an it pulls it
    slightly up or pushes it slightly down. Then open-loop sense
    amplifier detects this slight change in voltage of bit line.

    I don't see how any of that will work to distinguish between 4
    levels of volatge instead of 2 levels. Much less so, to distinguish
    between 16 levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand
    enough to know that flash cell's charge is not discharged into
    higher capacitance of bit line on every read operation.


    In theory you could have your DRAM cell hold different voltage
    levels. Your description of how DRAM works is mostly fine (AFAIUI - I
    am not a chip designer), except that I think the read sense
    amplifiers must have much lower input capacitance than the cells
    storage capacitors.


    It's not an amplifier that has high capacitance, It's a bit line.
    And it is inevitable*.
    The rest of your post does not make a lot of sense, because
    you don't take it into account. Most of the things you wrote there are
    correct, but they make no practical difference, because of this high capacitance.

    * - inevitable for as long as you want to have 4-8K rows per bank.
    But there are plenty of good reasons why you do you want to have many
    rows per bank, if not 4K then at very least 1K. I can list some
    reasons, but (1) it would be off topic, (2) I am not a specialist, so
    there is a danger of me being wrong in details.


    It would certainly be possible to put the write line at four
    different levels rather than just high or low - just like you can do
    with a flash cell. The trouble is that where the flash cell has
    extremely low leakage of its stored charge, DRAM cells use a small
    capacitor and have lots of leakage. As soon as you stop writing, the capacitor charge leaks back and forth to ground, to positive supply,
    to control lines, to neighbour cells, and so on. Reducing the
    voltage change from that leakage means using a bigger capacitor,
    which is slower to write.

    All this means that some time after you have written to a DRAM cell,
    the voltage on the cell capacitor is different from what you wrote.
    You have to read it, and re-write it, before the voltage changes
    enough to be misinterpreted. And the very act of reading changes the
    cell's charge too, depending on leakage to the nearby read lines, and
    the state of the input capacitance on the read lines. Some of all
    these leakages, changes and influences is relatively predictable from
    the layout of the chips - other parts vary significantly depending on temperature, the values in neighbouring cells, and the read/write
    patterns. (Remember "Rowhammer" ?)

    Storing 2 bits per cells makes all this hugely worse. And that means
    you need bigger storage capacitors to reduce the effect - making the
    cells bigger, slower and requiring more power. Your read and write circuitry becomes significantly more complex (and big, slow, and
    power hungry). And you have to have far shorter refresh cycles (you
    guessed it - it makes things bigger, slower and more power-hungry).

    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power. Clearly it would not be a good idea. But you /could/ do
    it.



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Sun Mar 8 12:36:03 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> writes:
    Cray-2 had no cache, same as 1, X-MP and Y-MP. Nevertheless, unlike
    1, and according to my understanding, unlike X-MP and Y-MP, memory
    hierarchy of Cray-2 was not flat.

    Here is a paragraph from Wikipedia:
    "To avoid this problem the new design banked memory and two sets of
    registers (the B- and T-registers) were replaced with a 16 KWord block
    of the very fastest memory possible called a Local Memory, not a cache, >attaching the four background processors to it with separate high-speed >pipes. This Local Memory was fed data by a dedicated foreground
    processor which was in turn attached to the main memory through a
    Gbit/s channel per CPU; X-MPs by contrast had three, for two
    simultaneous loads and a store and Y-MP/C-90s had five channels to
    avoid the von Neumann bottleneck. It was the foreground processor's
    task to "run" the computer, handling storage and making efficient use
    of the multiple channels into main memory.

    So the Cray-2 has an explicitly controlled memory hierarchy, like the
    SPUs of the Cell Broadband engine. The background processors
    perform architecturally visible work and are programmed explicitly
    (like software prefetching and unlike hardware prefetching, but that's
    all the commonality there is). My understanding is that writing back
    from local storage to main memory is also explicit.

    For dense-matrix multiplication, I expect that this architecture can
    be made to shine, for most other tasks, even in supercomputing, it's
    probably hard to achieve good utilization.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Sun Mar 8 15:10:52 2026
    From Newsgroup: comp.arch

    On 08/03/2026 12:37, Michael S wrote:
    On Sun, 8 Mar 2026 12:13:24 +0100
    David Brown <[email protected]> wrote:

    On 07/03/2026 23:28, Michael S wrote:
    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do
    it.

    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
    Then you connect your storage capacitor to bit line an it pulls it
    slightly up or pushes it slightly down. Then open-loop sense
    amplifier detects this slight change in voltage of bit line.

    I don't see how any of that will work to distinguish between 4
    levels of volatge instead of 2 levels. Much less so, to distinguish
    between 16 levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand
    enough to know that flash cell's charge is not discharged into
    higher capacitance of bit line on every read operation.


    In theory you could have your DRAM cell hold different voltage
    levels. Your description of how DRAM works is mostly fine (AFAIUI - I
    am not a chip designer), except that I think the read sense
    amplifiers must have much lower input capacitance than the cells
    storage capacitors.


    It's not an amplifier that has high capacitance, It's a bit line.
    And it is inevitable*.
    The rest of your post does not make a lot of sense, because
    you don't take it into account. Most of the things you wrote there are correct, but they make no practical difference, because of this high capacitance.

    * - inevitable for as long as you want to have 4-8K rows per bank.
    But there are plenty of good reasons why you do you want to have many
    rows per bank, if not 4K then at very least 1K. I can list some
    reasons, but (1) it would be off topic, (2) I am not a specialist, so
    there is a danger of me being wrong in details.


    Ultimately, it doesn't make a big difference if the capacitance here is
    in the sense amplifier, or the lines feeding it - though I can fully appreciate that the majority of the capacitance is in the lines here.

    If the line capacitance here is so much higher than the cell capacitor
    (and I'll happily bow to your knowledge here), then it means the voltage
    seen by the sense capacitor will be a fraction of the charge voltage on
    the cell capacitor. Fair enough - it just means the voltage threshold
    between a 1 and a 0 is that much lower. But the rest of the argument
    remains basically the same. If you want to hold more than one bit in
    the cell, you need to distinguish between four voltage levels instead of
    2 voltage levels. The relative differences in voltages are the same,
    the influences on cell capacitor leakage are the same. But as the
    absolute voltage levels into the sense amplifier are now much smaller,
    other noise sources (like thermal noise) are relatively speaking more important, which reduces your error margins even more. So again we are
    left with the multi-level DRAM cell being theoretically possible, but
    now the practicality is even worse - and you probably also need to
    reduce the number of bits per sense amplifier line to reduce the
    capacitance.

    As far as I can very roughly estimate, I believe Mitch's comment that
    you could put 2 bits in a single DRAM cell, but I think the
    disadvantages would be more dramatic than he suggested. It does not
    surprise me that nobody makes multi-level DRAM cells (AFAIK).


    It would certainly be possible to put the write line at four
    different levels rather than just high or low - just like you can do
    with a flash cell. The trouble is that where the flash cell has
    extremely low leakage of its stored charge, DRAM cells use a small
    capacitor and have lots of leakage. As soon as you stop writing, the
    capacitor charge leaks back and forth to ground, to positive supply,
    to control lines, to neighbour cells, and so on. Reducing the
    voltage change from that leakage means using a bigger capacitor,
    which is slower to write.

    All this means that some time after you have written to a DRAM cell,
    the voltage on the cell capacitor is different from what you wrote.
    You have to read it, and re-write it, before the voltage changes
    enough to be misinterpreted. And the very act of reading changes the
    cell's charge too, depending on leakage to the nearby read lines, and
    the state of the input capacitance on the read lines. Some of all
    these leakages, changes and influences is relatively predictable from
    the layout of the chips - other parts vary significantly depending on
    temperature, the values in neighbouring cells, and the read/write
    patterns. (Remember "Rowhammer" ?)

    Storing 2 bits per cells makes all this hugely worse. And that means
    you need bigger storage capacitors to reduce the effect - making the
    cells bigger, slower and requiring more power. Your read and write
    circuitry becomes significantly more complex (and big, slow, and
    power hungry). And you have to have far shorter refresh cycles (you
    guessed it - it makes things bigger, slower and more power-hungry).

    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power. Clearly it would not be a good idea. But you /could/ do
    it.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Sun Mar 8 10:56:33 2026
    From Newsgroup: comp.arch

    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

    In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
    chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done. I can't find any reference for that
    claim, tho. Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC). DRAM doesn't really
    have that option.


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sun Mar 8 18:30:24 2026
    From Newsgroup: comp.arch

    On Sun, 8 Mar 2026 15:10:52 +0100
    David Brown <[email protected]> wrote:

    On 08/03/2026 12:37, Michael S wrote:
    On Sun, 8 Mar 2026 12:13:24 +0100
    David Brown <[email protected]> wrote:

    On 07/03/2026 23:28, Michael S wrote:
    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do
    it.

    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level
    volatge. Then you connect your storage capacitor to bit line an
    it pulls it slightly up or pushes it slightly down. Then
    open-loop sense amplifier detects this slight change in voltage
    of bit line.

    I don't see how any of that will work to distinguish between 4
    levels of volatge instead of 2 levels. Much less so, to
    distinguish between 16 levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand
    enough to know that flash cell's charge is not discharged into
    higher capacitance of bit line on every read operation.


    In theory you could have your DRAM cell hold different voltage
    levels. Your description of how DRAM works is mostly fine (AFAIUI
    - I am not a chip designer), except that I think the read sense
    amplifiers must have much lower input capacitance than the cells
    storage capacitors.


    It's not an amplifier that has high capacitance, It's a bit line.
    And it is inevitable*.
    The rest of your post does not make a lot of sense, because
    you don't take it into account. Most of the things you wrote there
    are correct, but they make no practical difference, because of this
    high capacitance.

    * - inevitable for as long as you want to have 4-8K rows per bank.
    But there are plenty of good reasons why you do you want to have
    many rows per bank, if not 4K then at very least 1K. I can list some reasons, but (1) it would be off topic, (2) I am not a specialist,
    so there is a danger of me being wrong in details.


    Ultimately, it doesn't make a big difference if the capacitance here
    is in the sense amplifier, or the lines feeding it - though I can
    fully appreciate that the majority of the capacitance is in the lines
    here.

    If the line capacitance here is so much higher than the cell
    capacitor (and I'll happily bow to your knowledge here), then it
    means the voltage seen by the sense capacitor will be a fraction of
    the charge voltage on the cell capacitor. Fair enough - it just
    means the voltage threshold between a 1 and a 0 is that much lower.


    No, it does not mean that.
    In 1-bit scenario your sense amplifier works in open loop - it
    amplifies difference between precharge voltage and observed voltage on
    bit line as strongly as it can, All it cares about is the sign of
    difference. It does not care about exact ratio between capacitance of
    bit line and capacitance of storage capacitor. It also does not care
    if cell was freshly refreshed or is near the end of refresh interval.
    It does not care about its own gain, as long as gain is high enough.
    All that make distinguishing between 2 levels not just a little
    simpler, but a whole lot simpler than distinguishing between multiple
    level, even when "multiple" is just 4 or even 3.
    Once again, I am not specialist, but would imagine that latter would
    require totally different level of precision, both in the value of
    storage capacitor and in the capacitance of bit line.
    Amplifier itself would have to be much more complicated, too. It will
    likely need two stages - 1st sample-and-hold and only then set of
    comparators. All that will likely increase the power of row access by
    much bigger factor than 2x.
    But I don't think that the whole idea could proceed far enough to start
    to care about power. Above-mentioned requirements of geometrical
    precision will kill it much earlier.

    But the rest of the argument remains basically the same. If you want
    to hold more than one bit in the cell, you need to distinguish
    between four voltage levels instead of 2 voltage levels. The
    relative differences in voltages are the same, the influences on cell capacitor leakage are the same. But as the absolute voltage levels
    into the sense amplifier are now much smaller, other noise sources
    (like thermal noise) are relatively speaking more important, which
    reduces your error margins even more. So again we are left with the multi-level DRAM cell being theoretically possible, but now the
    practicality is even worse - and you probably also need to reduce the
    number of bits per sense amplifier line to reduce the capacitance.

    As far as I can very roughly estimate, I believe Mitch's comment that
    you could put 2 bits in a single DRAM cell, but I think the
    disadvantages would be more dramatic than he suggested.

    That's my point.
    At the end, not only would you have worse power and worse speed, but
    worse density as well.

    It does not
    surprise me that nobody makes multi-level DRAM cells (AFAIK).


    On the other hand, it is possible that the second idea that makes NAND
    so dense could apply to DRAM with better chance of achieving something positive. I mean, 3D QLC flash is so dense not only due to QLC, but
    also due to 3D.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Sun Mar 8 12:53:32 2026
    From Newsgroup: comp.arch

    Stefan Monnier wrote:
    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

    In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
    chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done. I can't find any reference for that
    claim, tho. Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC). DRAM doesn't really
    have that option.


    === Stefan

    A quicky search for "multi-bit" "dram" finds a recent paper:

    IGZO 2T0C DRAM With VTH Compensation Technique
    for Multi-Bit Applications, 2025 https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf

    which demonstrates 3 bits per cell but in just 25 cells.

    "we proposed and experimentally demonstrated the novel dual-gate (DG) indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C)
    dynamic random-access memory (DRAM) for array-level multi-bit storage.
    ...
    the optimized transistors... enable long retention time (>1500 s)
    and ultra-fast writing speed (< 10 ns).
    ...
    non-overlap 3-bit storage operation among 25 cells is achieved
    ...
    Recent research efforts have mostly focused on developing capacitor-less two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural innovation primarily aims to save the large space occupied by storage
    capacitor in conventional one-transistor-one-capacitor (1T1C) design
    ...
    Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
    non-destructive, which enables multi-bit storage in bit-cell"



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@[email protected] (John Dallman) to comp.arch on Sun Mar 8 17:53:00 2026
    From Newsgroup: comp.arch

    In article <[email protected]>, [email protected] (Anton Ertl) wrote:

    IA-64 certainly had significantly larger code sizes than others,
    but I think that they expected it and found it acceptable.

    The code size would have been better had they obtained the hoped-for ILP.
    Too much was riding on that.

    Also, bulky code is hard on memory bandwidth and cache size, and IA-64
    needed more of those than its competitors.

    And a software-pipelined loop on IA-64 is probably smaller than an auto-vectorized loop on AMD64+AVX

    The code I was working on at the time (and still do) didn't _have_ many
    loops that the compiler could software-pipeline, or that can be auto- vectorised now. It's pretty branchy and does a lot of pointer-chasing.
    Making it reliable and extending its functionality has always been higher priority than redesigning its many algorithms for vectorisation. It is
    largely memory-bound.

    Actually, other architectures also added prefetching instructions
    for dealing with that problem. All I have read about that was that
    there were many disappointments when using these instructions.
    I don't know if there were any successes, and how frequent they
    were compared to the disappointments.

    I have never encountered any successes, and given how keen Intel were on
    their x86 version of this, and my employers' relationship with them at
    the time, I would expect to have heard about them. My own experience was disappointing, with minor speedups and slowdowns. My best hypothesis was
    that the larger code size worsened cache effects enough to cancel out any
    gains from the prefetches.

    So I don't see that IA-64 was any different from other architectures
    in that respect.

    Two points on that:

    Prefetches were a fundamental architectural feature of IA-64, and Intel professed to believe in their effectiveness. Further, they came into
    registers, rather than cache.

    The loading into registers was part of an architectural bug with
    prefetches of floating-point values. If you did a floating-point advance
    into a callee-save register, then called a function which actually did
    save that register, the sequence of event could easily come out as:

    Advance floating-point load into Rn.
    ...
    Call function.
    Function pushes Rn.
    ...
    Advance load arrives, possibly messing up the function.
    ...
    Function pops Rn.
    Function returns.
    ...
    Check the floating-point load. The ALAT says it has happened, which is
    true. However, the value has been lost.

    The "fix" adopted was to re-issue all outstanding floating-point loads
    after each function call and return. That bulked out the code still more.


    OoO helps in several ways: it will do some work in the shadow of the
    load (although the utilization will still be abysmal even with
    present-day schedulers and ROBs [1]); but more importantly, it can
    dispatch additional loads that may also miss the cache, resulting in
    more memory-level parallelism.

    Yup.

    They wanted to do it (and did it) in the compiler; the corresponding architectural feature is IIRC the advanced load.

    And it failed, comprehensively.

    Having so many registers may have mad it harder than otherwise, but
    SPARC also used many registers

    Not really on the same scale, surely?

    The issue is that speculative execution and OoO makes all the
    EPIC features of IA-64 unnecessary, so if they cannot do
    a fast in-order implementation of IA-64 (and they could not), they
    should just give up and switch to an architecture without these
    features, such as AMD64. And Intel did, after a few years of
    denying.

    Yes. They claimed at the time they would bring IA-64 back, when the fab technology was better. I don't think anyone believed them at the time,
    but an Intel marketing person I talked to a few years later was quite
    shocked at the idea that everyone knew this was nonsense, and they were
    just being humoured because arguing was pointless.

    In a world where we see convergence on fewer and fewer architecture
    styles and on fewer and fewer architectures, you only see the
    investment necessary for high-performance implementations of a new architecture if there is a very good reason not to use one of the
    established architectures (for ARM T32 and ARM A64 the smartphone
    market was that reason). It may be that politics will provide that
    reason for another architecture, but even then it's hard. But
    RISC-V seems to have the most mindshare among the alternatives,
    so if any architecture will catch up, it looks like the best bet.

    I was expecting the old tradition of "We have better computers, come and
    buy them" to have some effect, but it doesn't seem to be happening for
    RISC-V. There are at least two companies that were trying to design high-performance RISC-V cores, in MIPS, who have been taken over and seem
    to be focused on other things now, and Ahead Computing, who haven't done
    much visible since they were formed.

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Sun Mar 8 19:39:40 2026
    From Newsgroup: comp.arch

    On 08/03/2026 17:30, Michael S wrote:
    On Sun, 8 Mar 2026 15:10:52 +0100
    David Brown <[email protected]> wrote:

    On 08/03/2026 12:37, Michael S wrote:
    On Sun, 8 Mar 2026 12:13:24 +0100
    David Brown <[email protected]> wrote:

    On 07/03/2026 23:28, Michael S wrote:
    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do
    it.

    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level
    volatge. Then you connect your storage capacitor to bit line an
    it pulls it slightly up or pushes it slightly down. Then
    open-loop sense amplifier detects this slight change in voltage
    of bit line.

    I don't see how any of that will work to distinguish between 4
    levels of volatge instead of 2 levels. Much less so, to
    distinguish between 16 levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand
    enough to know that flash cell's charge is not discharged into
    higher capacitance of bit line on every read operation.


    In theory you could have your DRAM cell hold different voltage
    levels. Your description of how DRAM works is mostly fine (AFAIUI
    - I am not a chip designer), except that I think the read sense
    amplifiers must have much lower input capacitance than the cells
    storage capacitors.


    It's not an amplifier that has high capacitance, It's a bit line.
    And it is inevitable*.
    The rest of your post does not make a lot of sense, because
    you don't take it into account. Most of the things you wrote there
    are correct, but they make no practical difference, because of this
    high capacitance.

    * - inevitable for as long as you want to have 4-8K rows per bank.
    But there are plenty of good reasons why you do you want to have
    many rows per bank, if not 4K then at very least 1K. I can list some
    reasons, but (1) it would be off topic, (2) I am not a specialist,
    so there is a danger of me being wrong in details.


    Ultimately, it doesn't make a big difference if the capacitance here
    is in the sense amplifier, or the lines feeding it - though I can
    fully appreciate that the majority of the capacitance is in the lines
    here.

    If the line capacitance here is so much higher than the cell
    capacitor (and I'll happily bow to your knowledge here), then it
    means the voltage seen by the sense capacitor will be a fraction of
    the charge voltage on the cell capacitor. Fair enough - it just
    means the voltage threshold between a 1 and a 0 is that much lower.


    No, it does not mean that.
    In 1-bit scenario your sense amplifier works in open loop - it
    amplifies difference between precharge voltage and observed voltage on
    bit line as strongly as it can, All it cares about is the sign of
    difference. It does not care about exact ratio between capacitance of
    bit line and capacitance of storage capacitor. It also does not care
    if cell was freshly refreshed or is near the end of refresh interval.
    It does not care about its own gain, as long as gain is high enough.
    All that make distinguishing between 2 levels not just a little
    simpler, but a whole lot simpler than distinguishing between multiple
    level, even when "multiple" is just 4 or even 3.

    I didn't realise they used that kind of topology for the amplifier - and
    then yes, I agree that is going to be a good deal simpler. It also has
    the advantage that it is independent of differences in the capacitances
    in different cells, or tolerances in the capacitance of the lines and
    the amplifier itself. I have no idea how much variation there is
    between these, but it is undoubtedly good to eliminate it as a factor.

    Once again, I am not specialist, but would imagine that latter would
    require totally different level of precision, both in the value of
    storage capacitor and in the capacitance of bit line.

    I think so, yes. My own familiarity is with more discrete electronics,
    rather than on chips - it's the same principles, but the details are different, and the practical solutions can be quite different. You
    might not be a specialist here, but I have learned from your posts here.

    Amplifier itself would have to be much more complicated, too. It will
    likely need two stages - 1st sample-and-hold and only then set of comparators. All that will likely increase the power of row access by
    much bigger factor than 2x.

    Agreed.

    But I don't think that the whole idea could proceed far enough to start
    to care about power. Above-mentioned requirements of geometrical
    precision will kill it much earlier.

    But the rest of the argument remains basically the same. If you want
    to hold more than one bit in the cell, you need to distinguish
    between four voltage levels instead of 2 voltage levels. The
    relative differences in voltages are the same, the influences on cell
    capacitor leakage are the same. But as the absolute voltage levels
    into the sense amplifier are now much smaller, other noise sources
    (like thermal noise) are relatively speaking more important, which
    reduces your error margins even more. So again we are left with the
    multi-level DRAM cell being theoretically possible, but now the
    practicality is even worse - and you probably also need to reduce the
    number of bits per sense amplifier line to reduce the capacitance.

    As far as I can very roughly estimate, I believe Mitch's comment that
    you could put 2 bits in a single DRAM cell, but I think the
    disadvantages would be more dramatic than he suggested.

    That's my point.
    At the end, not only would you have worse power and worse speed, but
    worse density as well.


    Yes.

    It does not
    surprise me that nobody makes multi-level DRAM cells (AFAIK).


    On the other hand, it is possible that the second idea that makes NAND
    so dense could apply to DRAM with better chance of achieving something positive. I mean, 3D QLC flash is so dense not only due to QLC, but
    also due to 3D.


    Scaling in the third direction would surely be good, yes. The
    challenge, I would think, would be heat dissipation. Flash does not
    need power to hold its data, so for the same speed of reading and
    writing you have basically the same power requirements regardless of the number of layers. With DRAM, power and heat would scale with the layers
    due to refreshes. You already see heat sinks for fast DRAM, so for a multi-layer device you'd need to put a lot more effort into cooling.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From David Brown@[email protected] to comp.arch on Sun Mar 8 19:43:24 2026
    From Newsgroup: comp.arch

    On 08/03/2026 17:53, EricP wrote:
    Stefan Monnier wrote:
    As I say, I am not a chip designer.  But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

         In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
         chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done.  I can't find any reference for that
    claim, tho.  Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC).  DRAM doesn't really
    have that option.


    === Stefan

    A quicky search for "multi-bit" "dram" finds a recent paper:

    IGZO 2T0C DRAM With VTH Compensation Technique
    for Multi-Bit Applications, 2025 https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf

    which demonstrates 3 bits per cell but in just 25 cells.

    "we proposed and experimentally demonstrated the novel dual-gate (DG) indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C) dynamic random-access memory (DRAM) for array-level multi-bit storage.
    ...
    the optimized transistors... enable long retention time (>1500 s)
    and ultra-fast writing speed (< 10 ns).
    ...
    non-overlap 3-bit storage operation among 25 cells is achieved
    ...
    Recent research efforts have mostly focused on developing capacitor-less two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural innovation primarily aims to save the large space occupied by storage capacitor in conventional one-transistor-one-capacitor (1T1C) design
    ...
    Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
    non-destructive, which enables multi-bit storage in bit-cell"


    If you eliminate the storage capacitor, does that not also eliminate the
    need for refresh? And then you have SRAM rather than DRAM? Or does the transistor pair still leak charge?


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@[email protected] to comp.arch on Sun Mar 8 18:59:52 2026
    From Newsgroup: comp.arch

    Michael S <[email protected]> schrieb:

    No, it does not mean that.
    In 1-bit scenario your sense amplifier works in open loop - it
    amplifies difference between precharge voltage and observed voltage on
    bit line as strongly as it can, All it cares about is the sign of
    difference. It does not care about exact ratio between capacitance of
    bit line and capacitance of storage capacitor. It also does not care
    if cell was freshly refreshed or is near the end of refresh interval.
    It does not care about its own gain, as long as gain is high enough.
    All that make distinguishing between 2 levels not just a little
    simpler, but a whole lot simpler than distinguishing between multiple
    level, even when "multiple" is just 4 or even 3.

    I want a balanced ternary computer, and I want it NOW!
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@[email protected] to comp.arch on Sun Mar 8 21:03:56 2026
    From Newsgroup: comp.arch

    On Sun, 8 Mar 2026 19:39:40 +0100
    David Brown <[email protected]> wrote:

    On 08/03/2026 17:30, Michael S wrote:
    On Sun, 8 Mar 2026 15:10:52 +0100

    On the other hand, it is possible that the second idea that makes
    NAND so dense could apply to DRAM with better chance of achieving
    something positive. I mean, 3D QLC flash is so dense not only due
    to QLC, but also due to 3D.


    Scaling in the third direction would surely be good, yes. The
    challenge, I would think, would be heat dissipation. Flash does not
    need power to hold its data, so for the same speed of reading and
    writing you have basically the same power requirements regardless of
    the number of layers. With DRAM, power and heat would scale with the
    layers due to refreshes. You already see heat sinks for fast DRAM,
    so for a multi-layer device you'd need to put a lot more effort into
    cooling.


    In the short run, I expect that there is an economical obstacle.
    For NAND flash, density is a premium feature. There exists big market
    at which for a given [large] capacity, manufacturer can charge more for
    denser SSD, often even when it delivers lower bandwidth than less dense alternative.

    For DRAM, since start of current stage of AI bum, the main premium
    feature is bandwidth. I recently compared prices of 32GB DDR4 and
    DDR5 ECC DIMMs, and was shocked by the difference. Density, on the other
    hand, appears to be far less important.
    3D DRAM is not likely to promise higher bandwidth at given capacity,
    more likely an opposite is true. So, in order to be accepted in the
    current economical climate, it would have to promise something else,
    most likely significantly lower cost per bit.
    But new tech rarely has lower cost than more established tech :(

    Those are considerations that make me skeptical about promise of 3D
    DRAM in the short term. Long term is something else. Long term,
    everything is possible.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 8 14:19:49 2026
    From Newsgroup: comp.arch

    On 3/8/2026 9:56 AM, Stefan Monnier wrote:
    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

    In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
    chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done. I can't find any reference for that
    claim, tho. Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC). DRAM doesn't really
    have that option.


    This is part of why my idea is that one could replace the
    capacitor/sense mechanism from DRAM with MOSFETs in a NAND or NOR configuration (but using normal MOSFETs rather than the FGMOS in Flash).

    In MOSFETs, if you can disconnect the gate, it functions as a capacitor
    that can hold its previous state.

    In FGMOS, they rely on tunneling to modify the gate voltage (with
    comparably high gate voltages), but this leads to break-down over time.

    One option could be to replace the floating gate with a pair of MOSFETs
    (for pull-up and pull-down), which can be used to modify the gate (when neither the pull-up nor pull-down line is asserted, the gate floats).
    Leakage would still cause it to lose its value over time, but would
    still be more stable than the capacitors in normal DRAM.

    If the Pull Up/Down signals run perpendicular to the pull up/down
    Vcc/Vdd signals (which toggle on/off when writing rows), then it is
    possible to write on a grid (though activating a row for writing any of
    the cells would tend to degrade bits already written to that row via
    leakage).


    Unlike FGMOS though, the 3 MOSFET configuration would effectively have
    an unlimited number of write cycles.


    As noted, the difficulty in writing multiple bits to each cell, which
    would go into analog territory.



    === Stefan

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 8 14:34:35 2026
    From Newsgroup: comp.arch

    On 3/8/2026 1:59 PM, Thomas Koenig wrote:
    Michael S <[email protected]> schrieb:

    No, it does not mean that.
    In 1-bit scenario your sense amplifier works in open loop - it
    amplifies difference between precharge voltage and observed voltage on
    bit line as strongly as it can, All it cares about is the sign of
    difference. It does not care about exact ratio between capacitance of
    bit line and capacitance of storage capacitor. It also does not care
    if cell was freshly refreshed or is near the end of refresh interval.
    It does not care about its own gain, as long as gain is high enough.
    All that make distinguishing between 2 levels not just a little
    simpler, but a whole lot simpler than distinguishing between multiple
    level, even when "multiple" is just 4 or even 3.

    I want a balanced ternary computer, and I want it NOW!

    Ternary could make sense for memory and signaling while mostly keeping
    to binary for actual logic (so that it still looks like a binary
    computer as far as software is concerned).

    In this case, the bridges would mostly map between multiples of 3 bits
    and multiples of 2 trits, with each 2 trits being treated as an octal
    number.


    Though, such a thing would seem to favor 9-bit bytes rather than 8-bit
    bytes though. Though the extra bits could be used for other purposes,
    like error detection, memory tagging, or ECC.

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Mar 8 20:54:39 2026
    From Newsgroup: comp.arch


    Robert Finch <[email protected]> posted:

    Tonight’s tradeoff was having the memory page size determined by the
    root pointer. A few bits (5) in a root pointer could be used to set the
    page size. All references through that root pointer would then use the specified page size. When the root pointer changes, the page size goes
    along with it.

    My 66000 uses a 3-bit level indicator in every page-table-pointer and PTE.
    The root pointer LVL determines the size of the VAS
    The PTP LVLs determine the size of pages on the level being accessed.
    The PTE LVL = 001.

    The number of address bits that come from PTE and from VA are determined
    by the LVL of the PTP that pointed to this translating PTE That is the
    previous PTP.

    I use 8KB pages, so the list goes 8KB, 8MB, 8GB, 8TB, 8EB, 8PB, really-big.
    And each page provides 1024 (freely mixed) entries.

    This scheme allows for level skipping at the top, in the middle, and at
    the bottom (super-pages). Unused levels are checked for canonicality.

    I think not flushing the TLB could be got away with, with ASID matching
    on the entries. For a given ASID the page size would be consistent with
    the root pointer.

    This is what ASIDs are for.

    Alternately the TLB entry could be tagged with the root pointer register number, so if a different root pointer register is used the entry would
    not match.

    ASIDs, tag everything with ASIDs, and provide an INVAL-ASID instruction.

    I have been studying the 68851 MMU. Quite complex compared to some other MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.

    Even Moto figured out -851 was too far over the top, use -030 MMU instead.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Mar 8 21:08:58 2026
    From Newsgroup: comp.arch


    David Brown <[email protected]> posted:

    On 08/03/2026 12:37, Michael S wrote:
    On Sun, 8 Mar 2026 12:13:24 +0100
    David Brown <[email protected]> wrote:

    On 07/03/2026 23:28, Michael S wrote:
    On Sat, 07 Mar 2026 01:49:31 GMT
    MitchAlsup <[email protected]d> wrote:

    BGB <[email protected]> posted:

    ----see how easy it is to snip useless material-------

    Idle:
    Wonders if some of the approaches used in SSDs could be used to
    make higher-density DRAM.

    As far as putting 2-bits in a single DRAM cell, yes you could do
    it.

    How exactly?
    With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
    Then you connect your storage capacitor to bit line an it pulls it
    slightly up or pushes it slightly down. Then open-loop sense
    amplifier detects this slight change in voltage of bit line.

    I don't see how any of that will work to distinguish between 4
    levels of volatge instead of 2 levels. Much less so, to distinguish
    between 16 levels, as in QLC flash.

    In flash memory it works very differently.
    Not that I understand how it works completely, but I understand
    enough to know that flash cell's charge is not discharged into
    higher capacitance of bit line on every read operation.


    In theory you could have your DRAM cell hold different voltage
    levels. Your description of how DRAM works is mostly fine (AFAIUI - I
    am not a chip designer), except that I think the read sense
    amplifiers must have much lower input capacitance than the cells
    storage capacitors.


    It's not an amplifier that has high capacitance, It's a bit line.
    And it is inevitable*.
    The rest of your post does not make a lot of sense, because
    you don't take it into account. Most of the things you wrote there are correct, but they make no practical difference, because of this high capacitance.

    * - inevitable for as long as you want to have 4-8K rows per bank.
    But there are plenty of good reasons why you do you want to have many
    rows per bank, if not 4K then at very least 1K. I can list some
    reasons, but (1) it would be off topic, (2) I am not a specialist, so
    there is a danger of me being wrong in details.


    Ultimately, it doesn't make a big difference if the capacitance here is
    in the sense amplifier, or the lines feeding it - though I can fully appreciate that the majority of the capacitance is in the lines here.

    If the line capacitance here is so much higher than the cell capacitor
    (and I'll happily bow to your knowledge here), then it means the voltage seen by the sense capacitor will be a fraction of the charge voltage on
    the cell capacitor. Fair enough - it just means the voltage threshold between a 1 and a 0 is that much lower. But the rest of the argument remains basically the same. If you want to hold more than one bit in
    the cell, you need to distinguish between four voltage levels instead of
    2 voltage levels.

    Back in the 2.5V Vdd age, a DRAM cell would move the precharged bit line
    less than 60mV. Sense amps just had to be well balanced to read this
    out rapidly. I suspect the modern sense amps use something close to 20mV
    today.

    The relative differences in voltages are the same,
    the influences on cell capacitor leakage are the same. But as the
    absolute voltage levels into the sense amplifier are now much smaller,
    other noise sources (like thermal noise) are relatively speaking more important, which reduces your error margins even more. So again we are
    left with the multi-level DRAM cell being theoretically possible, but
    now the practicality is even worse - and you probably also need to
    reduce the number of bits per sense amplifier line to reduce the capacitance.

    There is a product of resolution and speed which cannot be avoided.
    As resolution has to go up, speed has to go down. In this case;
    resolution is how many unique values can be sensed, and speed is
    the number of nanoseconds it takes.

    There are A2Ds with 22-bit resolution, but they operate at MHz speeds.
    The DRAM sense amp has a resolution of 2-states and a speed of GHs
    with a single noise margin (30mV). A 4-state DRAM cell would need
    three 10mV noise margin, precharge resolution to better than 10mV,'
    and sense amp resolution less than 10mV--this is phono amplification
    levels of noise.

    As far as I can very roughly estimate, I believe Mitch's comment that
    you could put 2 bits in a single DRAM cell, but I think the
    disadvantages would be more dramatic than he suggested. It does not surprise me that nobody makes multi-level DRAM cells (AFAIK).

    Quite right--possible--but not likely to be good on various tradeoffs.



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Mar 8 21:15:57 2026
    From Newsgroup: comp.arch


    [email protected] (John Dallman) posted:

    In article <[email protected]>, [email protected] (Anton Ertl) wrote:
    ---------------------

    Actually, other architectures also added prefetching instructions
    for dealing with that problem. All I have read about that was that
    there were many disappointments when using these instructions.
    I don't know if there were any successes, and how frequent they
    were compared to the disappointments.

    I have never encountered any successes, and given how keen Intel were on their x86 version of this, and my employers' relationship with them at
    the time, I would expect to have heard about them. My own experience was disappointing, with minor speedups and slowdowns. My best hypothesis was
    that the larger code size worsened cache effects enough to cancel out any gains from the prefetches.

    So I don't see that IA-64 was any different from other architectures
    in that respect.

    Two points on that:


    While I have, personally, added prefetch SW instructions and HW prefetchers, these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Sun Mar 8 21:18:22 2026
    From Newsgroup: comp.arch


    David Brown <[email protected]> posted:

    On 08/03/2026 17:53, EricP wrote:
    Stefan Monnier wrote:
    As I say, I am not a chip designer.  But I think that while you could >>> make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

         In 1997, NEC demonstrated a dynamic random-access memory (DRAM) >>      chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done.  I can't find any reference for that
    claim, tho.  Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC).  DRAM doesn't really >> have that option.


    === Stefan

    A quicky search for "multi-bit" "dram" finds a recent paper:

    IGZO 2T0C DRAM With VTH Compensation Technique
    for Multi-Bit Applications, 2025 https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf

    which demonstrates 3 bits per cell but in just 25 cells.

    "we proposed and experimentally demonstrated the novel dual-gate (DG) indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C) dynamic random-access memory (DRAM) for array-level multi-bit storage.
    ...
    the optimized transistors... enable long retention time (>1500 s)
    and ultra-fast writing speed (< 10 ns).
    ...
    non-overlap 3-bit storage operation among 25 cells is achieved
    ...
    Recent research efforts have mostly focused on developing capacitor-less two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural innovation primarily aims to save the large space occupied by storage capacitor in conventional one-transistor-one-capacitor (1T1C) design
    ...
    Compared with 1T1C DRAM, the read operation of 2T0C DRAM is non-destructive, which enables multi-bit storage in bit-cell"


    If you eliminate the storage capacitor, does that not also eliminate the need for refresh?

    If you eliminate the storage capacitor, you eliminate the ability to store charge. Storing charge is the ONLY thing a DRAM cell does.

    And then you have SRAM rather than DRAM? Or does the transistor pair still leak charge?

    Every thing leaks, the capacitor is there so that the stored value can
    be retained "for at least a while"


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Sun Mar 8 17:18:44 2026
    From Newsgroup: comp.arch

    David Brown wrote:
    On 08/03/2026 17:53, EricP wrote:
    Stefan Monnier wrote:
    As I say, I am not a chip designer. But I think that while you could
    make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

    In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
    chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done. I can't find any reference for that
    claim, tho. Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC). DRAM doesn't really >>> have that option.


    === Stefan

    A quicky search for "multi-bit" "dram" finds a recent paper:

    IGZO 2T0C DRAM With VTH Compensation Technique
    for Multi-Bit Applications, 2025
    https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf

    which demonstrates 3 bits per cell but in just 25 cells.

    "we proposed and experimentally demonstrated the novel dual-gate (DG)
    indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C)
    dynamic random-access memory (DRAM) for array-level multi-bit storage.
    ...
    the optimized transistors... enable long retention time (>1500 s)
    and ultra-fast writing speed (< 10 ns).
    ...
    non-overlap 3-bit storage operation among 25 cells is achieved
    ...
    Recent research efforts have mostly focused on developing capacitor-less
    two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural
    innovation primarily aims to save the large space occupied by storage
    capacitor in conventional one-transistor-one-capacitor (1T1C) design
    ...
    Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
    non-destructive, which enables multi-bit storage in bit-cell"


    If you eliminate the storage capacitor, does that not also eliminate the need for refresh? And then you have SRAM rather than DRAM? Or does the transistor pair still leak charge?

    The charge is stored on the floating gate of one of the two transistors.
    I gather that "oxide semiconductors", indium gallium zinc oxide being one,
    are very low leakage which in this case gives the >1500s retention time.

    Also the standard 1T1C cell is destructive read while
    the 2T0C IGZO cell is non-destructive read.

    There seems to be a recent flurry of research into oxide semiconductors
    for DRAM, and IGZO in particular.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 8 16:37:29 2026
    From Newsgroup: comp.arch

    On 3/8/2026 3:54 PM, MitchAlsup wrote:

    Robert Finch <[email protected]> posted:

    Tonight’s tradeoff was having the memory page size determined by the
    root pointer. A few bits (5) in a root pointer could be used to set the
    page size. All references through that root pointer would then use the
    specified page size. When the root pointer changes, the page size goes
    along with it.

    My 66000 uses a 3-bit level indicator in every page-table-pointer and PTE. The root pointer LVL determines the size of the VAS
    The PTP LVLs determine the size of pages on the level being accessed.
    The PTE LVL = 001.

    The number of address bits that come from PTE and from VA are determined
    by the LVL of the PTP that pointed to this translating PTE That is the previous PTP.

    I use 8KB pages, so the list goes 8KB, 8MB, 8GB, 8TB, 8EB, 8PB, really-big. And each page provides 1024 (freely mixed) entries.

    This scheme allows for level skipping at the top, in the middle, and at
    the bottom (super-pages). Unused levels are checked for canonicality.


    My case, there is both the MMU control-register and page-table base:
    MMU control register specifies the minimum page size (system scale);
    Page table base specifies things per page-table (size, page table
    depth/type);
    ...

    though, my page-table walker code isn't fully generic, so things like page-size / page-table layout / etc, were more configured at compile time.



    I think not flushing the TLB could be got away with, with ASID matching
    on the entries. For a given ASID the page size would be consistent with
    the root pointer.

    This is what ASIDs are for.


    Yes.

    Theoretically, one could also have a different page size for each VAS,
    but global pages could not be shared between address spaces with a
    different root page size (at least not with a set-associative TLB design).


    Alternately the TLB entry could be tagged with the root pointer register
    number, so if a different root pointer register is used the entry would
    not match.

    ASIDs, tag everything with ASIDs, and provide an INVAL-ASID instruction.

    I have been studying the 68851 MMU. Quite complex compared to some other
    MMUs. I will likely have a 68851 compatible MMU for my 68000 project though. >>
    Even Moto figured out -851 was too far over the top, use -030 MMU instead.

    In my case, still getting along OK with fairly minimal MMU hardware.

    more complex case is more the handling of page-access rights and behavior.

    Originally, I defined a scheme which had used a combination of:
    Global access flags;
    User/Group RWX flags (like traditional Unix file permissions);
    ACL Checking (with an ACL Miss event).

    It is starting to seem like in retrospect, the former two should have
    been left out, with ACL Checking serving as the primary access control mechanism for everything (and with more bits left to encode page-access semantics).

    Say, for example:
    User R/W/X
    Supervisor R/W/X
    NoCache

    Some hair could have been avoided by keeping user/supervisor R/W/X as
    separate flags and then keeping NoCache solely for NoCache, and not essentially having the Superpvisor+NoCache and RWX flags effectively
    pulling double-duty and also trying to awkwardly encode separate access
    modes for User and Supervisor.

    In this case, the existing R/W/X+NC flags could serve solely as an access-schema selector (with ASID and KRR) also serving as part of the ACL-Miss handling algorithm.

    Would mean that the ACL Cache would also need to use the Schema as part
    of the key, so likely becoming another set-associative cache (or using a
    6 or 8 entry full-assoc cache).


    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 8 16:43:19 2026
    From Newsgroup: comp.arch

    On 3/8/2026 4:15 PM, MitchAlsup wrote:

    [email protected] (John Dallman) posted:

    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:
    ---------------------

    Actually, other architectures also added prefetching instructions
    for dealing with that problem. All I have read about that was that
    there were many disappointments when using these instructions.
    I don't know if there were any successes, and how frequent they
    were compared to the disappointments.

    I have never encountered any successes, and given how keen Intel were on
    their x86 version of this, and my employers' relationship with them at
    the time, I would expect to have heard about them. My own experience was
    disappointing, with minor speedups and slowdowns. My best hypothesis was
    that the larger code size worsened cache effects enough to cancel out any
    gains from the prefetches.

    So I don't see that IA-64 was any different from other architectures
    in that respect.

    Two points on that:


    While I have, personally, added prefetch SW instructions and HW prefetchers, these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.

    Yeah...

    This area seems to be a lost cause, as cache misses seem to be a rather inescapable cost IME.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Sun Mar 8 17:06:26 2026
    From Newsgroup: comp.arch

    On 3/8/2026 4:18 PM, MitchAlsup wrote:

    David Brown <[email protected]> posted:

    On 08/03/2026 17:53, EricP wrote:
    Stefan Monnier wrote:
    As I say, I am not a chip designer.  But I think that while you could >>>>> make DRAM cells with 2 bits per cell, the cells would be more than
    twice the size as well as several times slower and demanding much
    more power.

    According to https://en.wikipedia.org/wiki/Multi-level_cell:

         In 1997, NEC demonstrated a dynamic random-access memory (DRAM) >>>>      chip with quad-level cells, holding a capacity of 4 Gbit.

    So apparently it's been done.  I can't find any reference for that
    claim, tho.  Has anyone heard of it?

    Comparing the situation of DRAM vs flash storage, I notice that MLC
    cells in flash storage sometimes read the data multiple times to get
    a more precise reading of the voltage level (IIUC).  DRAM doesn't really >>>> have that option.


    === Stefan

    A quicky search for "multi-bit" "dram" finds a recent paper:

    IGZO 2T0C DRAM With VTH Compensation Technique
    for Multi-Bit Applications, 2025
    https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf

    which demonstrates 3 bits per cell but in just 25 cells.

    "we proposed and experimentally demonstrated the novel dual-gate (DG)
    indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C)
    dynamic random-access memory (DRAM) for array-level multi-bit storage.
    ...
    the optimized transistors... enable long retention time (>1500 s)
    and ultra-fast writing speed (< 10 ns).
    ...
    non-overlap 3-bit storage operation among 25 cells is achieved
    ...
    Recent research efforts have mostly focused on developing capacitor-less >>> two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural >>> innovation primarily aims to save the large space occupied by storage
    capacitor in conventional one-transistor-one-capacitor (1T1C) design
    ...
    Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
    non-destructive, which enables multi-bit storage in bit-cell"


    If you eliminate the storage capacitor, does that not also eliminate the
    need for refresh?

    If you eliminate the storage capacitor, you eliminate the ability to store charge. Storing charge is the ONLY thing a DRAM cell does.


    But, can be different:
    Does it store a charge that is dumped onto the bit-line?
    (Typical DRAM)
    Does it store a charge in a MOSFET gate to modulate resistance?
    (Typical of Flash storage)


    And then you have SRAM rather than DRAM? Or does the
    transistor pair still leak charge?

    Every thing leaks, the capacitor is there so that the stored value can
    be retained "for at least a while"

    Yeah.
    Only real other way to store a bit being something like a Flip-Flop, but
    an FF only has two stable states. Trying to store an analog value in a
    FF would likely result in it quickly decaying into one of the possible
    states.

    Though, seems like with some cleverness it could be possible to create a ternary flip-flop (by slightly mutilating the design of the MOSFETs, *).

    *: Say, rather than MOSFETs with a single source/drain and gate, there
    could be a configuration with multiple conjoined MOSFETs between the
    high and low levels (sort of like an H-Bridge) with cross-linking
    between low/high taps and the gates on the opposite transistors, with center-taps for the main outputs.

    By using some more taps and gates, it could be possible to have multiple stable voltage levels between each pair (with the gate converging
    towards the closest stable voltage level).

    Though, multi-level SRAMs could make sense.





    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@[email protected] to comp.arch on Mon Mar 9 03:36:38 2026
    From Newsgroup: comp.arch

    On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:

    I *am* skeptical that supporting page-crossing (or even block- crossing) accesses is important enough to justify a lot of complexity and extra hardware,

    I am not. If an architecture is defined so that programmers are not
    expected to know how big pages are on any given implementation, then locks have to just work always. If this can be achieved in a simple fashion,
    that's great, but if one is implementing an instance of a previously
    defined architecture, for which a lot of software already exists - then
    don't break anything, whatever it takes.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Brett@[email protected] to comp.arch on Mon Mar 9 04:50:30 2026
    From Newsgroup: comp.arch

    Robert Finch <[email protected]> wrote:
    Tonight’s tradeoff was having the memory page size determined by the
    root pointer. A few bits (5) in a root pointer could be used to set the
    page size. All references through that root pointer would then use the specified page size. When the root pointer changes, the page size goes
    along with it.

    I think not flushing the TLB could be got away with, with ASID matching
    on the entries. For a given ASID the page size would be consistent with
    the root pointer.

    Alternately the TLB entry could be tagged with the root pointer register number, so if a different root pointer register is used the entry would
    not match.

    I have been studying the 68851 MMU. Quite complex compared to some other MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.


    The 68851 MMU was such a disaster that never worked right that Motorola
    only used a subset of the design inside the 68020. Or so I had heard.

    Spent hours going over the design myself, never having looked at a MMU
    before, thought they were insane.

    SUN hated the 68851 and never used it, until the 68030 legacy system as
    Sparc came out.

    A write up on the differences might be interesting.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Robert Finch@[email protected] to comp.arch on Mon Mar 9 03:01:42 2026
    From Newsgroup: comp.arch

    On 2026-03-09 12:50 a.m., Brett wrote:
    Robert Finch <[email protected]> wrote:
    Tonight’s tradeoff was having the memory page size determined by the
    root pointer. A few bits (5) in a root pointer could be used to set the
    page size. All references through that root pointer would then use the
    specified page size. When the root pointer changes, the page size goes
    along with it.

    I think not flushing the TLB could be got away with, with ASID matching
    on the entries. For a given ASID the page size would be consistent with
    the root pointer.

    Alternately the TLB entry could be tagged with the root pointer register
    number, so if a different root pointer register is used the entry would
    not match.

    I have been studying the 68851 MMU. Quite complex compared to some other
    MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.


    The 68851 MMU was such a disaster that never worked right that Motorola
    only used a subset of the design inside the 68020. Or so I had heard.

    Spent hours going over the design myself, never having looked at a MMU before, thought they were insane.

    SUN hated the 68851 and never used it, until the 68030 legacy system as
    Sparc came out.

    A write up on the differences might be interesting.

    I thought it looked not too bad on paper.

    I am up to about 700 LOC writing an emulator for it now; mostly a
    brute-force approach though.

    Another Trade-off (different MMU): making the L1 TLB not respect the
    lock status of entries. L1 updates may overwrite locked entries.
    However, L2 does respect the lock status. So, if a subsequent request to
    a locked entry is done it will be loaded from L2.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@[email protected] (Anton Ertl) to comp.arch on Mon Mar 9 08:04:39 2026
    From Newsgroup: comp.arch

    David Brown <[email protected]> writes:
    On 08/03/2026 17:53, EricP wrote:
    IGZO 2T0C DRAM With VTH Compensation Technique
    for Multi-Bit Applications, 2025
    https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf
    ...
    Recent research efforts have mostly focused on developing capacitor-less
    two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural
    innovation primarily aims to save the large space occupied by storage
    capacitor in conventional one-transistor-one-capacitor (1T1C) design
    ...
    Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
    non-destructive, which enables multi-bit storage in bit-cell"


    If you eliminate the storage capacitor, does that not also eliminate the >need for refresh? And then you have SRAM rather than DRAM?

    In SRAM each flipflop refreshes itself all the time.

    My guess (I should read the paper, but speculating is more fun:-) at
    the 0C DRAM is that it does not discharge on reading, and thus does
    not need a refresh after reading, but it still leaks, and therefore
    needs a refresh now and then. My guess is that it charges the gate of
    a MOS transistor or maybe two, for CMOS action. If it's just one,
    there would be two transistors per bit cell: the one mentioned above
    for reading, and another one for writing. That might explain the 2T0C terminology. By charging the gate to different voltages, multiple
    bits can be stored. The reading part here would be like the reading
    part of flash memory.

    Another benefit would be that the current for charging the bit line
    comes from the source or drain of the transistor, so it may be
    possible to work with a smaller charge, i.e. the gate would be smaller
    than the capacitor in conventional DRAM. But if the charge is
    smaller, I would expect it to leak faster; OTOH, it's not necessary to
    retain enough charge to make a measurable voltage swing on the
    bitline, only enough that the reading transistor still works (and that
    all the voltage levels are still discernible), and maybe that decay
    takes longer (but I find that hard to believe in the case of more than
    two voltage levels).

    Or does the
    transistor pair still leak charge?

    Even flash, where the gate is not connected to anything, leaks and
    needs refreshing (on the order of months or years, depending on
    temperature and degradation of the flash device). In the device I
    have in mind there will be this leakage, plus leakage through the
    write transistor.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@[email protected] to comp.arch on Mon Mar 9 11:05:56 2026
    From Newsgroup: comp.arch

    quadi [2026-03-09 03:36:38] wrote:
    On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:
    I *am* skeptical that supporting page-crossing (or even block- crossing)
    accesses is important enough to justify a lot of complexity and extra
    hardware,
    I am not. If an architecture is defined so that programmers are not
    expected to know how big pages are on any given implementation, then locks have to just work always.

    I thought "natural alignment" is the standard solution to this problem.
    Avoids unaligned accesses at all levels, regardless of the sizes of
    pages or cache lines (as long as stick to powers of 2) at the cost of
    adding padding, but that is usually considered as negligible.

    Unaligned accesses can still be important enough for some particular circumstances, but usually the programmers knows about it and AFAIK they
    never need to be atomic.


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Mon Mar 9 13:14:11 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    [email protected] (John Dallman) posted:

    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:
    ---------------------

    Actually, other architectures also added prefetching instructions
    for dealing with that problem. All I have read about that was that
    there were many disappointments when using these instructions.
    I don't know if there were any successes, and how frequent they
    were compared to the disappointments.
    I have never encountered any successes, and given how keen Intel were on
    their x86 version of this, and my employers' relationship with them at
    the time, I would expect to have heard about them. My own experience was
    disappointing, with minor speedups and slowdowns. My best hypothesis was
    that the larger code size worsened cache effects enough to cancel out any
    gains from the prefetches.

    So I don't see that IA-64 was any different from other architectures
    in that respect.
    Two points on that:


    While I have, personally, added prefetch SW instructions and HW prefetchers, these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.

    One area I think might be a benefit is to prefetch VA translations
    for instructions and data. These can be prefetched just into cache,
    or into I- and D- TLB's.

    I had the idea in 2010 while looking at locking and hardware transactions.
    If a memory section is guarded by a mutex, I don't want to prefetch
    the data as that could yank ownership away from the current mutex holder.

    What I might do is prefetch the translation PTE's for the data locations
    so that when I am granted mutex ownership that I minimize the time
    it is held by not waiting for cold memory table walks.

    I also optionally might like to be able to trigger advance page faults
    on data but without actually touching the data page such that it moves
    cache line ownership. This could save me from taking a page fault on a
    shared memory section while holding the guard mutex.

    Prefetching the VA translates for instructions could be a good
    tradeoff for alternate paths and just load the PTE's into cache,
    as opposed to loading the alternate code into cache.

    The VA translate prefetch instructions would need options to control
    which cache and I- and D- TLB level the PTE's are prefetched into.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Mon Mar 9 19:30:28 2026
    From Newsgroup: comp.arch


    EricP <[email protected]> posted:

    MitchAlsup wrote:
    [email protected] (John Dallman) posted:

    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:
    ---------------------

    Actually, other architectures also added prefetching instructions
    for dealing with that problem. All I have read about that was that
    there were many disappointments when using these instructions.
    I don't know if there were any successes, and how frequent they
    were compared to the disappointments.
    I have never encountered any successes, and given how keen Intel were on >> their x86 version of this, and my employers' relationship with them at
    the time, I would expect to have heard about them. My own experience was >> disappointing, with minor speedups and slowdowns. My best hypothesis was >> that the larger code size worsened cache effects enough to cancel out any >> gains from the prefetches.

    So I don't see that IA-64 was any different from other architectures
    in that respect.
    Two points on that:


    While I have, personally, added prefetch SW instructions and HW prefetchers,
    these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.

    One area I think might be a benefit is to prefetch VA translations
    for instructions and data. These can be prefetched just into cache,
    or into I- and D- TLB's.

    I had the idea in 2010 while looking at locking and hardware transactions.
    If a memory section is guarded by a mutex, I don't want to prefetch
    the data as that could yank ownership away from the current mutex holder.

    Then you need a LD instruction that can fail and the status tested by
    some other instruction. That is: code performs a LD; LD takes a miss
    and leaves the CPU. Access finds cache line in modified or exclusive
    state, and instead of returning the value and making line stale, it
    fails. {with whatever definition you want for fail}. Since MOESI uses
    3-bits, you can use an unused MOESI state to record that a failed access
    has transpired--then use this to optimize downstream-cache behavior.

    What I might do is prefetch the translation PTE's for the data locations
    so that when I am granted mutex ownership that I minimize the time
    it is held by not waiting for cold memory table walks.

    This is what table-walk accelerators are for.

    I also optionally might like to be able to trigger advance page faults
    on data but without actually touching the data page such that it moves
    cache line ownership. This could save me from taking a page fault on a
    shared memory section while holding the guard mutex.

    Prefetching the VA translates for instructions could be a good
    tradeoff for alternate paths and just load the PTE's into cache,
    as opposed to loading the alternate code into cache.

    The VA translate prefetch instructions would need options to control
    which cache and I- and D- TLB level the PTE's are prefetched into.

    I use 5-bits for this (although in practice 3 would have been sufficient)
    {PRE, PUSH}×{{RWX}+{Cc}}}
    where Cc tells which cache layer the data is fetched up into orpushed
    back down into.
    PUSH {{010}+{--}} is simple invalidate and throw away modifications.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Terje Mathisen@[email protected] to comp.arch on Mon Mar 9 22:42:06 2026
    From Newsgroup: comp.arch

    Kent Dickey wrote:
    There is one very reasonable use case: testing a random number generator.
    A useful test is to ensure numbers are uncorrelated, so you get 3 random numbers called A, B, C, and you look up A*N*N + B*N + C to count the number of times you see A followed by B followed by C, where N is the range of
    the random value, say, 0 - 1023. This would be an array of 1 billion 32-bit

    I would be quite happy with half the size, i.e. 1e9 u16 entries.

    values. You get 1000 billion random numbers, and then look through to make sure most buckets have a value around 1000. Any buckets less than 500 or more than 1500 might be considered a random number generator failure.
    This is a useful test since it intuitively makes sense--if some patterns are too likely (or unlikely), then you know you have a problem with your
    "random" numbers.

    I haven't done the math, but I would guess getting any deviation outside
    the 800-1200 range would be quite unlikely, and at least suspicious!


    Another use case would be an algorithm which wants to shuffle a large
    array (say, you want to create test cases for a sorting algorithm). I
    think most shuffling algorithms which are fair will randomly index into
    the array, and each of these will be a cache miss.

    You could do random writes instead of random reads, that turns it into a
    "how many simultaneous write buffers do we have" problem.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From John Levine@[email protected] to comp.arch on Tue Mar 10 06:07:19 2026
    From Newsgroup: comp.arch

    According to Stefan Monnier <[email protected]>:
    I *am* skeptical that supporting page-crossing (or even block- crossing) >>> accesses is important enough to justify a lot of complexity and extra
    hardware, ...
    I thought "natural alignment" is the standard solution to this problem. >Avoids unaligned accesses at all levels, regardless of the sizes of
    pages or cache lines (as long as stick to powers of 2) at the cost of
    adding padding, but that is usually considered as negligible.

    That was the theory on S/360 in 1964. By 1968 they added the "byte oriented" feature to 360/85 to allow unaligned access. Many RISC chips went the same way, starting with mandatory alignment, adding unaligned access later. What's different now?

    Unaligned accesses can still be important enough for some particular >circumstances, but usually the programmers knows about it and AFAIK they >never need to be atomic.

    I think that's right, atomic access is a fairly special case.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@[email protected] to comp.arch on Tue Mar 10 13:04:43 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <[email protected]> posted:

    MitchAlsup wrote:
    [email protected] (John Dallman) posted:

    In article <[email protected]>,
    [email protected] (Anton Ertl) wrote:
    ---------------------

    Actually, other architectures also added prefetching instructions
    for dealing with that problem. All I have read about that was that >>>>> there were many disappointments when using these instructions.
    I don't know if there were any successes, and how frequent they
    were compared to the disappointments.
    I have never encountered any successes, and given how keen Intel were on >>>> their x86 version of this, and my employers' relationship with them at >>>> the time, I would expect to have heard about them. My own experience was >>>> disappointing, with minor speedups and slowdowns. My best hypothesis was >>>> that the larger code size worsened cache effects enough to cancel out any >>>> gains from the prefetches.

    So I don't see that IA-64 was any different from other architectures >>>>> in that respect.
    Two points on that:

    While I have, personally, added prefetch SW instructions and HW prefetchers,
    these tend to add performance rather sporadically, and seldom add "enough" >>> performance to justify taking up 'that much' of ISA or designer time.
    One area I think might be a benefit is to prefetch VA translations
    for instructions and data. These can be prefetched just into cache,
    or into I- and D- TLB's.

    I had the idea in 2010 while looking at locking and hardware transactions. >> If a memory section is guarded by a mutex, I don't want to prefetch
    the data as that could yank ownership away from the current mutex holder.

    Then you need a LD instruction that can fail and the status tested by
    some other instruction. That is: code performs a LD; LD takes a miss
    and leaves the CPU. Access finds cache line in modified or exclusive
    state, and instead of returning the value and making line stale, it
    fails. {with whatever definition you want for fail}. Since MOESI uses
    3-bits, you can use an unused MOESI state to record that a failed access
    has transpired--then use this to optimize downstream-cache behavior.

    Interesting - a load conditional on the cache line being either
    - cached locally in an MOES state
    - cached remotely in an S state
    - uncached

    Does your ESM use this approach?

    What I might do is prefetch the translation PTE's for the data locations
    so that when I am granted mutex ownership that I minimize the time
    it is held by not waiting for cold memory table walks.

    This is what table-walk accelerators are for.

    If by table walk accelerator you mean caching the interior level PTE's
    on the downward walk, and if there is a PTE miss checking them in a
    bottom-up table walk, then that mechanism is still there and
    my PTE prefetch would make use of it.

    I also optionally might like to be able to trigger advance page faults
    on data but without actually touching the data page such that it moves
    cache line ownership. This could save me from taking a page fault on a
    shared memory section while holding the guard mutex.

    Prefetching the VA translates for instructions could be a good
    tradeoff for alternate paths and just load the PTE's into cache,
    as opposed to loading the alternate code into cache.

    The VA translate prefetch instructions would need options to control
    which cache and I- and D- TLB level the PTE's are prefetched into.

    I use 5-bits for this (although in practice 3 would have been sufficient) {PRE, PUSH}×{{RWX}+{Cc}}}
    where Cc tells which cache layer the data is fetched up into orpushed
    back down into.
    PUSH {{010}+{--}} is simple invalidate and throw away modifications.

    I would also have 2 or 3 cache control bits on all levels of PTE's
    but I would have separate lookup tables for interior and leaf PTE's.
    The tables map the cache control bits to the kind of caching the
    for that table level.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Tue Mar 10 18:28:07 2026
    From Newsgroup: comp.arch


    EricP <[email protected]> posted: ------------------------------
    I had the idea in 2010 while looking at locking and hardware transactions. >> If a memory section is guarded by a mutex, I don't want to prefetch
    the data as that could yank ownership away from the current mutex holder.

    Then you need a LD instruction that can fail and the status tested by
    some other instruction. That is: code performs a LD; LD takes a miss
    and leaves the CPU. Access finds cache line in modified or exclusive
    state, and instead of returning the value and making line stale, it
    fails. {with whatever definition you want for fail}. Since MOESI uses 3-bits, you can use an unused MOESI state to record that a failed access has transpired--then use this to optimize downstream-cache behavior.

    Interesting - a load conditional on the cache line being either
    - cached locally in an MOES state
    - cached remotely in an S state
    - uncached

    Does your ESM use this approach?

    ESM solves the case where one CAN have more than 1 cache line in an
    ATOMIC state {I've got it and you can't get at it}; which has nothing
    to do with MEOSI.

    What I might do is prefetch the translation PTE's for the data locations >> so that when I am granted mutex ownership that I minimize the time
    it is held by not waiting for cold memory table walks.

    This is what table-walk accelerators are for.

    If by table walk accelerator you mean caching the interior level PTE's
    on the downward walk, and if there is a PTE miss checking them in a
    bottom-up table walk, then that mechanism is still there and
    my PTE prefetch would make use of it.

    TWA allows for any and all of that--whatever stores fit the needs.

    The Ross HyperSPARC's had a TWA consisting of comparitors spanning VA
    fields, so the hit also conveyed what level (down from top).

    I also optionally might like to be able to trigger advance page faults
    on data but without actually touching the data page such that it moves
    cache line ownership. This could save me from taking a page fault on a
    shared memory section while holding the guard mutex.

    Prefetching the VA translates for instructions could be a good
    tradeoff for alternate paths and just load the PTE's into cache,
    as opposed to loading the alternate code into cache.

    The VA translate prefetch instructions would need options to control
    which cache and I- and D- TLB level the PTE's are prefetched into.

    I use 5-bits for this (although in practice 3 would have been sufficient) {PRE, PUSH}×{{RWX}+{Cc}}}
    where Cc tells which cache layer the data is fetched up into orpushed
    back down into.
    PUSH {{010}+{--}} is simple invalidate and throw away modifications.

    I would also have 2 or 3 cache control bits on all levels of PTE's
    but I would have separate lookup tables for interior and leaf PTE's.
    The tables map the cache control bits to the kind of caching the
    for that table level.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@[email protected] to comp.arch on Tue Mar 10 16:41:30 2026
    From Newsgroup: comp.arch

    On 3/9/2026 10:05 AM, Stefan Monnier wrote:
    quadi [2026-03-09 03:36:38] wrote:
    On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:
    I *am* skeptical that supporting page-crossing (or even block- crossing) >>> accesses is important enough to justify a lot of complexity and extra
    hardware,
    I am not. If an architecture is defined so that programmers are not
    expected to know how big pages are on any given implementation, then locks >> have to just work always.

    I thought "natural alignment" is the standard solution to this problem. Avoids unaligned accesses at all levels, regardless of the sizes of
    pages or cache lines (as long as stick to powers of 2) at the cost of
    adding padding, but that is usually considered as negligible.

    Unaligned accesses can still be important enough for some particular circumstances, but usually the programmers knows about it and AFAIK they never need to be atomic.


    Yes, it makes sense and is reasonable to assume that misaligned access
    can never be assumed to be atomic even on an architecture natively
    supporting both misaligned access and atomic memory accesses.

    Well, and by extension, that an aligned access may never cross a page
    boundary (because it is not possible for it to do so).


    Things would break down if one gets into "non-power-of-2" territory, but
    most anyone sane doesn't go there (and any NPOT data usually exists as
    part of some sort of bit-packed format within a power-of-2 container).


    For most other cases where misaligned access is good/effective, such as
    data compression/string/memory-copy tasks, entropy coded bitstreams,
    etc, then it does not make sense to assume that these are atomic.

    Though, does usually make sense to assume that the source data is
    read-only, but this assumption breaks in some use cases, such as self-overlapping copies within an LZ decompressor. Though, in this case,
    the memory subsystem will need to enforce read/write ordering to ensure
    that overlapping reads and writes return the correct data.

    Though, this isn't usually much different from what is already needed to
    make sure that a cache-line remains coherent, except in a special case
    that RAW could still be allowed to proceed without penalty if a
    non-overlap could be verified; but by itself this would not address the statistically more common cache-line WAW penalty.

    The latter WAW penalty, at least excluding special case forwarding (adds cost), requires tasks like prolog/epilog sequences and memory copy to
    use more convoluted access patterns in an attempt to avoid stepping on
    "in flight" cache lines (with the naive linear forward) memory stores typically stepping on performance penalties.


    Though, I suspect this issue may not apply to typical x86-64 machines,
    as I am not aware of seeing any obvious performance difference based on
    the relative order of memory stores.

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@[email protected] to comp.arch on Wed Mar 11 00:18:58 2026
    From Newsgroup: comp.arch


    BGB <[email protected]> posted:

    On 3/9/2026 10:05 AM, Stefan Monnier wrote:
    quadi [2026-03-09 03:36:38] wrote:
    On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:
    I *am* skeptical that supporting page-crossing (or even block- crossing) >>> accesses is important enough to justify a lot of complexity and extra
    hardware,
    I am not. If an architecture is defined so that programmers are not
    expected to know how big pages are on any given implementation, then locks >> have to just work always.

    I thought "natural alignment" is the standard solution to this problem. Avoids unaligned accesses at all levels, regardless of the sizes of
    pages or cache lines (as long as stick to powers of 2) at the cost of adding padding, but that is usually considered as negligible.

    Unaligned accesses can still be important enough for some particular circumstances, but usually the programmers knows about it and AFAIK they never need to be atomic.


    Yes, it makes sense and is reasonable to assume that misaligned access
    can never be assumed to be atomic even on an architecture natively supporting both misaligned access and atomic memory accesses.

    Many cache coherence protocols cannot guarantee that a misaligned access
    can appear ATOMIC.

    Well, and by extension, that an aligned access may never cross a page boundary (because it is not possible for it to do so).

    One cannot, in general, allow misaligned access and disallow page-crossing misaligned access.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Chris M. Thomasson@[email protected] to comp.arch on Tue Mar 10 17:20:38 2026
    From Newsgroup: comp.arch

    On 3/9/2026 11:07 PM, John Levine wrote:
    According to Stefan Monnier <[email protected]>:
    I *am* skeptical that supporting page-crossing (or even block- crossing) >>>> accesses is important enough to justify a lot of complexity and extra
    hardware, ...
    I thought "natural alignment" is the standard solution to this problem.
    Avoids unaligned accesses at all levels, regardless of the sizes of
    pages or cache lines (as long as stick to powers of 2) at the cost of
    adding padding, but that is usually considered as negligible.

    That was the theory on S/360 in 1964. By 1968 they added the "byte oriented" feature to 360/85 to allow unaligned access. Many RISC chips went the same way, starting with mandatory alignment, adding unaligned access later. What's
    different now?

    Unaligned accesses can still be important enough for some particular
    circumstances, but usually the programmers knows about it and AFAIK they
    never need to be atomic.

    I think that's right, atomic access is a fairly special case.


    Yeah. See what happens on an x86/x64 with a LOCK RMW on a word that
    straddles a cache line... ;^)
    --- Synchronet 3.21d-Linux NewsLink 1.2