Paul Clayton <[email protected]> writes:An even more common example (numbering in the 100M to 1B range?) is x86 processors with interruptible REP MOVS/STOS/LODS instructions.
On 11/13/25 5:13 PM, MitchAlsup wrote:
[snip]
[email protected] (Anton Ertl) posted:
What I wanted to write was "And assembly language is
architecture-specific".
I have worked on a single machine with several different ASM "compilers". >>> Believe me, one asm can be different than another asm.
But it is absolutely true that asm is architecture specific.
Is that really *absolutely* true? Architecture usually includes binary>> encoding (and memory order model and perhaps other non-assembly details).
I do not know if being able to have an interrupt in the middle of an
assembly instruction is a violation of the assembly contract. (In
theory, a few special cases might be handled such that the assembly
instruction that breaks into more than one machine instruction is
handled similarly to breaking instructions into µops.) There might not
be any practical case where all the sub-instructions of an assembly
instruction are also assembly instructions (especially not if
retaining instruction size compatibility, which would be difficult
with such assembly instruction fission anyway).
The classic case is the VAX MOVC3/MOVC5 instructions. An interrupt
could occur during the move and simply restart the instruction
(the register operands having been updated as each byte was moved).
Paul Clayton <[email protected]> posted:
reasonable to have a fuzzier sense of assembly language to include at
least encoding changes. It seems reasonable to me for "assembly
language" to mean the preferred language for simple mapping to machine>> instructions (which can include idioms — different spellings of the
same machine instruction — and macros).
The modern sense of ASM is that it is an ASCII version of binary.
The old sense where ASM was a language that could do anything and
everything (via Macros) has slipped into the past.
In article <10lbcg1$3uh8h$[email protected]>, [email protected] (Paul Clayton) wrote:
I _feel_ that if only the opcode encoding is changed (a very tiny
difference that would only affect using code as data) that one
could rightly state that the new architecture uses the same
assembly.
That would, however, raise questions and doubts among everyone who was
aware of the different instruction encodings. You would do far better to
say that the new architecture is compatible at the assembler source level, but not at the binary level.
I doubt there could be any economic justification for
only changing the opcode encoding, but theoretically such could
have multiple architectures with the same assembly.
There was a threatened case of this in the early years of this century.
Intel admitted to themselves that AMD64 was trouncing Itanium in the marketplace, and they needed to do 64-bit x86 or see their company shrink dramatically. However, they did not want to do an AMD-compatible x86-64.
They wanted to use a different instruction encoding and have deliberate binary incompatibility.
This was crazy from the network externalities point of view. It was an anti-competitive move, requiring software vendors to do separate builds
for Intel and AMD, hoping that they would not bother with AMD builds.
Microsoft killed this idea, by refusing to support any such
Intel-specific 64-bit x86. They could not prevent Intel doing it, but
there would not be Windows for it. Intel had to climb down.
I do not think assembly language considered the possible effects of
memory order model. (Have all x86 implementations been compatible?
I think the specification changed, but I do not know if
compatibility was broken.)
In general, the assembly programmer is responsible for considering the
memory model, not the language implementation.
In addition to the definition for "assembly language" one also
needs to define "architecture".
Actually, the world seems to get on OK without such clear definitions.
The obscurity of assembly language tends to limit its use to those who
really need to use it, and who are prepared to use a powerful but
unforgiving tool.
Intel has sold incompatible architectures within the same design
by fusing off functionality and has even had different application
cores in the same chip have different instruction support (though
that seems to have bitten Intel).
Well, different ISA support in different cores in the same processor
package is just dumb[1]. It reflects a delusion that Intel has suffered
since at least the late 1990s: that software is specific to particular generations of their chips, and there's a new release with significant changes for each new generation. Plenty of Intel people know that is true
for motherboard firmware, but not for operating systems or application software. But the company carries on behaving that way.
[1] See the Cell processor for an extreme example.
On 1/28/26 10:34 AM, John Dallman wrote:-----------------
In article <10lbcg1$3uh8h$[email protected]>, [email protected] (Paul Clayton) wrote:
Would the Intel-64 have been assembly compatible with AMD64? I
would have guessed that not just encodings would have been
different. If one wants to maintain market friction, supporting
the same assembly seems counterproductive.
This was crazy from the network externalities point of view. It was an anti-competitive move, requiring software vendors to do separate builds
for Intel and AMD, hoping that they would not bother with AMD builds.
Cooperating with AMD to develop a more sane encoding while
supporting low overhead for old binaries would have been better
for customers (I think). However, doing what is best generally
for customers is not necessarily the most profitable action.
Microsoft killed this idea, by refusing to support any such
Intel-specific 64-bit x86. They could not prevent Intel doing it, but
there would not be Windows for it. Intel had to climb down.
Which was actually a sane action not just from the hassle to
Microsoft of supporting yet another ISA but the confusion of
users (Intel64 and AMD64 both run x86-32 binaries but neither
Intel64 nor AMD64 run the other's binaries!) which would impact
Microsoft (and PC OEMs) more than Intel.
I do not think assembly language considered the possible effects of
memory order model. (Have all x86 implementations been compatible?
I think the specification changed, but I do not know if
compatibility was broken.)
In general, the assembly programmer is responsible for considering the memory model, not the language implementation.
Yes, but for a single-threaded application this is not a factor —
so such would be more compatible. It is not clear if assembly
programmers would use less efficient abstractions (like locks) to
handle concurrency in which case a different memory model might
not impact correctness. On the one hand, assembly is generally
chosen because C provides insufficient performance (or
expressiveness), which would imply that assembly programmers
would not want to leave any performance on the table and would
exploit the memory model. On the other hand, the assembly
programmer mindset may often be more serial and the performance
cost of using higher abstractions for concurrency may be lower
than the debugging costs of being clever relative to using
cleverness for other optimizations.
In addition to the definition for "assembly language" one also
needs to define "architecture".
Actually, the world seems to get on OK without such clear definitions.
The obscurity of assembly language tends to limit its use to those who really need to use it, and who are prepared to use a powerful but unforgiving tool.
Yes, the niche effect helps to avoid diversity of meaning across
users and across time. I suspect jargon also changes less rapidly
than common language both because there is less interaction and
there is more pressure to be formal in expression.
Intel has sold incompatible architectures within the same design
by fusing off functionality and has even had different application
cores in the same chip have different instruction support (though
that seems to have bitten Intel).
Well, different ISA support in different cores in the same processor package is just dumb[1]. It reflects a delusion that Intel has suffered since at least the late 1990s: that software is specific to particular generations of their chips, and there's a new release with significant changes for each new generation. Plenty of Intel people know that is true for motherboard firmware, but not for operating systems or application software. But the company carries on behaving that way.
Paul Clayton <[email protected]> posted:[...]
On 1/28/26 10:34 AM, John Dallman wrote:-----------------
In article <10lbcg1$3uh8h$[email protected]>, [email protected] (Paul >>> Clayton) wrote:
Would the Intel-64 have been assembly compatible with AMD64? I
Andy Glew indicated similar but not exact enough.
Andy also stated that MicroSoft forced Intel's hand towards x86-64.
Currently, assembly-level compatibility does not seem worthwhile.
Software is usually distributed as machine code binaries not as
assembly,
Would the Intel-64 have been assembly compatible with AMD64? I
would have guessed that not just encodings would have been
different. If one wants to maintain market friction, supporting
the same assembly seems counterproductive.
Cooperating with AMD to develop a more sane encoding while
supporting low overhead for old binaries would have been better
for customers (I think).
It is not clear if assembly programmers would use less efficient
abstractions (like locks) to handle concurrency in which case
a different memory model might not impact correctness.
I do not think ISA heterogeneity is necessarily problematic.
I suspect it might require more system-level organization (similar
to Apple).
Even without ISA heterogeneity, optimal scheduling
seems to be a hard problem. Energy/power and delay/performance
preferences are not typically expressed. The abstraction of each
program owning the machine seems to discourage nice behavior (pun
intended).
(I thought Intel marketed their initial 512-bit SIMD processors
as GPGPUs with x86 compatibility, so the idea of having a
general purpose ISA morphed into a GPU-like ISA had some
fascination after Cell.)
Robert Finch <[email protected]> posted:
On 2025-11-05 1:47 a.m., Robert Finch wrote:-----------
I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.
Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that
reduced most instructions eight bits.
4 register specifiers: check.
I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.
With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.
One gotcha is that 64-bit constant overrides need to be modified. For
Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit
instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift
instruction. It is ugly and takes about three instructions.
It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.
Motto: never execute any instructions fetching or building constants.
I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.
Paul Clayton <[email protected]> posted:
Cooperating with AMD to develop a more sane encoding while
supporting low overhead for old binaries would have been better
for customers (I think). However, doing what is best generally
for customers is not necessarily the most profitable action.
Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
and making optimal battle plans for Little Big Horn battle to come.
One can still buy a milling machine built in 1937 and run it in his shop.
Can one even do this for software from the previous decade ??
MS wants you to buy Office every time you buy a new PC.
MS, then moves all the menu items to different pull downs and
makes it difficult to adjust to the new SW--and then it has the
Gaul to chew up valuable screen space with ever larger pull-
down bars.
Is it any wonder users want the 1937 milling machine model ???
On 11/5/25 3:52 PM, MitchAlsup wrote:
Robert Finch <[email protected]> posted:
On 2025-11-05 1:47 a.m., Robert Finch wrote:-----------
I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.
Qupls2024 also used 8-bit register specs. This was a bit of overkill and >> not really needed. Register specs are reduced to 6-bits. Right-away that >> reduced most instructions eight bits.
4 register specifiers: check.
I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.
With 48-bits, if you can get 2 instructions 50% of the time, you are only 12% bigger than a 32-bit ISA.
I must be misunderstanding your math; if half of the
6-byte instructions are two operations, I think that
means 12 bytes would have three operations which is
the same as for a 32-bit ISA.
Perhaps you meant for every two instructions, there
is a 50% chance neither can be "fused" and a 50%
chance they can be fused with each other; this would
get four operations in 18 bytes, which _is_ 12.5%
bigger. That seems an odd expression, as if the
ability to fuse was not quasi-independent.
It could just be that one of us has a "thought-O".
One gotcha is that 64-bit constant overrides need to be modified. For
Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit
instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift
instruction. It is ugly and takes about three instructions.
It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.
Motto: never execute any instructions fetching or building constants.
I am guessing that having had experience with x86
(and the benefit of predecode bits), you recognized
that VLE need not be horribly complex to parse.
My 66000 does not use "start bits", but the length
is quickly decoded from the first word and the
critical information is in mostly fixed locations
in the first word.
(One might argue that opcodeor link time.
can be in two locations depending on if the
instruction uses a 16-bit immediate or not —
assuming I remember that correctly.)
Obviously, something like DOUBLE could provide
extra register operands to a complex instruction,
though there may not be any operation needing
five register inputs. Similarly, opcode refinement
(that does not affect operation routing) could be
placed into an "immediate". I think you do not
expect to need such tricks because reduced
number of instructions is a design principle and
there is lots of opcode space remaining, but I
feel these also allow the ISA to be extended in
unexpected directions.
I think that motto could be generalized to "do
not do at decode time what can be done at
compile time" (building immediates could be
"executed" in decode). There are obvious limits
to that principle; e.g., one would not encode
instructions as control bits, i.e., "predecoded",
in order to avoid decode work. For My 66000
immediates, reducing decode work also decreases
code size.
Discerning when to apply a transformation and if/
where to cache the result seems useful. E.g., a
compiler caches the source code to machine code
transformation inside an executable binary. My
66000's Virtual Vector Method implementations
are expected, from what I understand, to cache
fetch and decode work and simplify operand
routing.
Caching branch prediction information in an
instruction seems to be viewed generally as not
worth much since dynamic predictors are generally
more accurate.
Static prediction by branch--- Synchronet 3.21b-Linux NewsLink 1.2
"type" (e.g., forward not-taken) can require no
additional information. (Branch prediction
_directives_ are somewhat different. Such might
be used to reduce the time for a critical path,
but average time is usually a greater concern.)
On 11/5/25 3:43 PM, MitchAlsup wrote:
[snip]
I am now working on predictors for a 6-wide My 66000 machine--which is a bit
different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.
Not recording the history of predicates may have a negative
effect on global history predictors. (I do not know if anyone
has studied this, but it has been mentioned — e.g.,
"[predication] has a negative side-effect because the removal
of branches eliminates useful correlation information
necessary for conventional branch predictors" from "Improving
Branch Prediction and Predicated Execution in Out-of-Order
Processors", Eduardo Quiñones et al., 2007.)
Predicate prediction can also be useful when the availability
of the predicate is delayed. Similarly, selective eager
execution might be worthwhile when the predicate is delayed;
the selection is likely to be predictive (resource use might
be a basis for selection but even estimating that might be
predictive).
On 2/5/26 2:02 PM, MitchAlsup wrote:
Paul Clayton <[email protected]> posted:
[snip]
Cooperating with AMD to develop a more sane encoding while
supporting low overhead for old binaries would have been better
for customers (I think). However, doing what is best generally
for customers is not necessarily the most profitable action.
Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
and making optimal battle plans for Little Big Horn battle to come.
Rather than making battle plans for how to annihilate each
other, perhaps finding a better solution than the ratting each
other out in the prisoner's dilemma.
[snip]
One can still buy a milling machine built in 1937 and run it in his shop. Can one even do this for software from the previous decade ??
Yes, but dependency on (proprietary) servers for some games has
made them (unnecessarily) unplayable.
From what I understand, one can still run WordPerfect under a
DOS emulator on modern x86-64.
With the poor security of much software, even OSes, one might
want to contain any legacy software in a more secured
environment.
Preventing automatic update is perhaps more of a hassle. Some
people have placed software in a virtual machine that has no
networking to avoid software breaking.
MS wants you to buy Office every time you buy a new PC.
I thought MS wanted everyone to use Office365. It is harder to
force people to get a new computer, but a monthly fee will recur automatically.
MS, then moves all the menu items to different pull downs and
makes it difficult to adjust to the new SW--and then it has the
Gaul to chew up valuable screen space with ever larger pull-
down bars.
Ah, but they are just beginning to include advertising. Imagine
every time one uses the mouse (to indicate to the computer that
the user's eyes are focused on a particular place) an
advertisement appears and follows the cursor movement. Even just
having menu entries that are advertisements would be kind of
annoying, but one would be able to get rid of those by leasing
the premium edition (until one needs to lease the platinum
edition, then the "who wants to remain a millionaire" edition).
Is it any wonder users want the 1937 milling machine model ???
Have no fear; soon you may be merely leasing your computer.
Computers need to have the latest spyware so that advertisements
can be appropriately targeted and adblocking must be made
impossible.
Paul Clayton <[email protected]> posted:
On 2/5/26 2:02 PM, MitchAlsup wrote:
I thought MS wanted everyone to use Office365. It is harder to
force people to get a new computer, but a monthly fee will recur
automatically.
When I need a tool--I buy that tool--I never rent that tool.
Name one feature I would want from office365 that was not already
present in office from <say> 1998.
MS, then moves all the menu items to different pull downs and
makes it difficult to adjust to the new SW--and then it has the
Gaul to chew up valuable screen space with ever larger pull-
down bars.
Ah, but they are just beginning to include advertising. Imagine
every time one uses the mouse (to indicate to the computer that
the user's eyes are focused on a particular place) an
advertisement appears and follows the cursor movement. Even just
having menu entries that are advertisements would be kind of
annoying, but one would be able to get rid of those by leasing
the premium edition (until one needs to lease the platinum
edition, then the "who wants to remain a millionaire" edition).
Why would I or anyone want advertising in office ????????
Is it any wonder users want the 1937 milling machine model ???
Have no fear; soon you may be merely leasing your computer.
Computers need to have the latest spyware so that advertisements
can be appropriately targeted and adblocking must be made
impossible.
I am the kind of guy that turns off "telemetry" and places advertisers
in /hosts file.
Why would I or anyone want advertising in office ????????
Name one feature I would want from office365 that was not already
present in office from <say> 1998.
Why would I or anyone want advertising in office ????????
Paul Clayton <[email protected]> posted:
On 2/5/26 2:02 PM, MitchAlsup wrote:
[snip]
Paul Clayton <[email protected]> posted:
Cooperating with AMD to develop a more sane encoding while
supporting low overhead for old binaries would have been better
for customers (I think). However, doing what is best generally
for customers is not necessarily the most profitable action.
Yes, imaging Custer (Intel) and AMD (Sioux) sitting down together
and making optimal battle plans for Little Big Horn battle to come.
Rather than making battle plans for how to annihilate each
other, perhaps finding a better solution than the ratting each
other out in the prisoner's dilemma.
[snip]
One can still buy a milling machine built in 1937 and run it in his shop. >>> Can one even do this for software from the previous decade ??
Yes, but dependency on (proprietary) servers for some games has
made them (unnecessarily) unplayable.
From what I understand, one can still run WordPerfect under a
DOS emulator on modern x86-64.
With the poor security of much software, even OSes, one might
want to contain any legacy software in a more secured
environment.
Preventing automatic update is perhaps more of a hassle. Some
people have placed software in a virtual machine that has no
networking to avoid software breaking.
MS wants you to buy Office every time you buy a new PC.
I thought MS wanted everyone to use Office365. It is harder to
force people to get a new computer, but a monthly fee will recur
automatically.
When I need a tool--I buy that tool--I never rent that tool.
Name one feature I would want from office365 that was not already
present in office from <say> 1998.
MS, then moves all the menu items to different pull downs and
makes it difficult to adjust to the new SW--and then it has the
Gaul to chew up valuable screen space with ever larger pull-
down bars.
Ah, but they are just beginning to include advertising. Imagine
every time one uses the mouse (to indicate to the computer that
the user's eyes are focused on a particular place) an
advertisement appears and follows the cursor movement. Even just
having menu entries that are advertisements would be kind of
annoying, but one would be able to get rid of those by leasing
the premium edition (until one needs to lease the platinum
edition, then the "who wants to remain a millionaire" edition).
Why would I or anyone want advertising in office ????????
Is it any wonder users want the 1937 milling machine model ???
Have no fear; soon you may be merely leasing your computer.
Computers need to have the latest spyware so that advertisements
can be appropriately targeted and adblocking must be made
impossible.
I am the kind of guy that turns off "telemetry" and places advertisers
in /hosts file.
On 09/02/2026 20:33, MitchAlsup wrote:
Paul Clayton <[email protected]> posted:
From what I understand, one can still run WordPerfect under a
DOS emulator on modern x86-64.
With the poor security of much software, even OSes, one might
want to contain any legacy software in a more secured
environment.
Most old software did not have poor security. It was secure by not
having features that could be abused - and thus no need to worry about
extra layers to protect said features. MS practically invented the
concept of insecure applications like word processors - they put
unnecessary levels of automation and macros, integrated it with email >(especially their already hopelessly insecure programs), and so on. No
real user has any need for "send this document by email" in their word >processor - but spam robots loved it. (MS even managed to figure out a
way to let font files have executable malware in them.) If you go back
to older tools that did the job they were supposed to do, without trying
to do everything else, security is a non-issue for most software.
MS wants you to buy Office every time you buy a new PC.
I thought MS wanted everyone to use Office365. It is harder to
force people to get a new computer, but a monthly fee will recur
automatically.
When I need a tool--I buy that tool--I never rent that tool.
On 11/4/2025 3:44 PM, Terje Mathisen wrote:[snip]
MitchAlsup wrote:
[email protected] (Anton Ertl) posted:
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes
for the branch predictors.
But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not
taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Where, the idea was that the state-machine in updated with the
current state and branch direction, giving the next state and
next predicted branch direction (for this state).
Could model slightly more complex patterns than the 2-bit
saturating counters, but it is sort of a partial mystery why
(for mainstream processors) more complex lookup schemes and 2
bit state, was preferable to a simpler lookup scheme and 5-bit
state.
Well, apart from the relative "dark arts" needed to cram 4-bit
patterns into a 5 bit FSM (is a bit easier if limiting the
patterns to 3 bits).
Then again, had before noted that the LLMs are seemingly also
not really able to figure out how to make a 5 bit FSM to model a
full set of 4 bit patterns.
Then again, I wouldn't expect it to be all that difficult of a
problem for someone that is "actually smart"; so presumably chip
designers could have done similar.
Well, unless maybe the argument is that 5 or 6 bits of storage
would cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative
predictive weakness of 2-bit state) would have costed more than
the cost of smaller tables of 6 bit state ?...
Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)
Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000 weak, not taken
001 weak, taken
010 strong, not taken
011 strong, taken
100 weak, alternating, not-taken
101 weak, alternating, taken
110 strong, alternating, not-taken
111 strong, alternating, taken
The alternating states just flip-flopping between taken and not
taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.
Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty
when getting to 4 bit patterns though (and can't fit both weak
and strong states for longer patterns, so the 4b patterns
effectively only exist as weak states which partly overlap with
the weak states for the 3-bit patterns).
But, yeah, not going to type out state tables for these ones.
Not proven, but I suspect that an arbitrary 5 bit pattern within
a 6 bit state might be impossible. Although there would be
sufficient state-space for the looping 5-bit patterns, there may
not be sufficient state-space to distinguish whether to move
from a mismatched 4-bit pattern to a 3 or 5 bit pattern.
Whereas, at least with 4-bit, any mismatch of the 4-bit pattern
can always decay to a 3-bit pattern, etc. One needs to be able
to express decay both to shorter patterns and to longer
patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern
extends indefinitely, I don't know...).
Could almost have this sort of thing as a "brain teaser" puzzle
or something...
Then again, maybe other people would not find any particular
difficulty in these sorts of tasks.
Terje
How can register renaming be implemented on SPARC? As discussed
above, this can be done independently: Have 96 architectural registers
(plus the window pointer), and make 8 of them global registers, and
the rest 24 visible registers plus 4 windows of 16 registers, with the
usual switching. And then rename these 96 architectural registers.
A variant in the opposite direction would be to treat only the 32Another possibility would be to use special register storage for
visible registers as architectural registers, avoiding large RAT
entries. The save instruction would emit store microinstructions for
the local and in registers, and then the renamer would rename the out registers to the in registers, and would assign 0 to the local and out registers (which would not occupy a physical register at first). This approach makes the most sense with a separate renamer as is now
common. The restore instruction would rename the in registers to the
out registers, and emit load microinstructions for the local and the
in registers.
OoO tends to work fine with storing around calls and loading around
returns in architectures without register windows, because the storing
mostly consumes resources, but is not on the critical path, and
likewise for the loading (the loads tend to be ready earlier than the instructions on the critical path); and store-to-load forwarding
deals with the problem of a return shortly after a call.
Robert Finch <[email protected]> posted:
Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for
protection or translation of the address.
You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary
Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.
In article <10m12ue$2t2k5$[email protected]>, [email protected] (Paul Clayton) wrote:
Currently, assembly-level compatibility does not seem worthwhile.
Not now, no. There was one case where it was valuable: the assembler
source translator for 8080 to 8086. That plus the resemblance of early
MS-DOS to CP/M meant that CP/M software written in assembler could be got working on the early IBM PC and compatibles more rapidly than new
software could be developed in high-level languages. That was one of the factors in the runaway success of PC-compatible machines in the early
1980s.
Software is usually distributed as machine code binaries not as
assembly,
Or as source code...
Would the Intel-64 have been assembly compatible with AMD64? I
would have guessed that not just encodings would have been
different. If one wants to maintain market friction, supporting
the same assembly seems counterproductive.
It would hardly have mattered. Very little assembler is written for
64-bit architectures.
Cooperating with AMD to develop a more sane encoding while
supporting low overhead for old binaries would have been better
for customers (I think).
Intel didn't admit to themselves they needed to do 64-bit x86 until AMD64
was thrashing them in the market. Far too late for collaborative design
by then.
It is not clear if assembly programmers would use less efficient
abstractions (like locks) to handle concurrency in which case
a different memory model might not impact correctness.
You are thinking of doing application programming in assembler. That's
pretty much extinct these days. Use of assembler to implement locks or
other concurrency-control mechanisms in an OS or a language run-time
library is far more likely.
I've been doing low-level parts of application development for over 40
years. In 1983-86, I was working in assembler, or needed to have a very
close awareness of the assembler code being generated by a higher level language. In 1987-1990, I needed to be able to call assembler-level OS functions from C code. Since then, the only coding I've done in assembler
has been to generate hardware error conditions for testing error handlers. I've read and debugged lots of compiler-generated assembler to report compiler bugs, but that has become far less common over time.
I do not think ISA heterogeneity is necessarily problematic.
It requires the OS scheduler to be ISA-aware, and to never, /ever/ put a thread onto a core that can't run the relevant ISA. That will inevitably
make the scheduler more complicated and thus increase system overheads.
I suspect it might require more system-level organization (similar
to Apple).
Have you ever tried to optimise multi-threaded performance on a modern
Apple system with a mixture of Performance and Efficiency cores? I have,
and it's a lot harder than Apple give the impression it will be.
Apple make an assumption: that you will use their "Grand Central Dispatch" threading model. That requires multi-threaded code to be structured as a one-direction pipeline of work packets, with buffers between them, and
one thread/core per pipeline stage. That's a sensible model for some
kinds of work, but not all kinds. It also requires compiler extensions
which don't exist on other compilers. So you have to fall back to POSIX threads to get flexibility and portability.
If you're using POSIX threads, the scheduler seems to assign threads to
cores randomly. So your worker threads spend a lot of time on Efficiency cores. Those are in different clusters from the Performance cores, which means that communications between threads (via locks) are very slow.
Using Apple's performance category attributes for threads has no obvious effect on this.
The way to fix this is to find out how many Performance cores there are
in a Performance cluster (which wasn't possible until macOS 12) and use
that many threads. Then you need to reach below the POSIX threading layer
to the underlying BSD thread layer. There, you can set an association
number on your threads, which tells the scheduler to try to run them in
the same cluster. Then you get stable and near-optimal performance. But finding out how to do this is fairly hard, and few seem to managed it.
Even without ISA heterogeneity, optimal scheduling
seems to be a hard problem. Energy/power and delay/performance
preferences are not typically expressed. The abstraction of each
program owning the machine seems to discourage nice behavior (pun
intended).
Allowing processes to find out the details of other processes' resource
usage makes life very complicated, and introduces new opportunities for security bugs.
(I thought Intel marketed their initial 512-bit SIMD processors
as GPGPUs with x86 compatibility, so the idea of having a
general purpose ISA morphed into a GPU-like ISA had some
fascination after Cell.)
Larabee turned out to be a pretty bad GPU, and a pretty bad set of CPUs.
On 11/5/25 2:00 AM, BGB wrote:
On 11/4/2025 3:44 PM, Terje Mathisen wrote:[snip]
MitchAlsup wrote:
[email protected] (Anton Ertl) posted:
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for
the branch predictors.
You might be interested in looking at the 6th Championship
Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th- championship-branch-prediction-cbp2025/
TAgged GEometric length predictors (TAGE) seem to the current
"hotness" for branch predictors. These record very long global
histories and fold them into shorter indexes with the number of
history bits used varying for different tables.
(Because the correlation is less strong, 3-bit counters are
generally used as well as a useful bit.)
But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Indexing a predictor by _local_ (i.e., per instruction address)
history adds a level of indirection; once one has the branch
(fetch) address one needs to index the local history and then
use that to index the predictor. The Alpha 21264 had a modest-
sized (by modern standards) local history predictor with a 1024-
entry table of ten history bits indexed by the Program Counter
and the ten bits were used to index a table of three-bit
counters. This was combined with a 12-bit global history
predictor with 2-bit counters (note: not gshare, i.e., xored
with the instruction address) and the same index was used for
the chooser.
I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.
TAGE has the advantage that the tags reduce branch aliases and
the variable history length (with history folding/compression)
allows using less storage (when a prediction only benefits from
a shorter history) and reduces training time.
(I am a little surprised that I have not read a suggestion to
use the alternate 2-bit encoding that retains the last branch
direction. This history might be useful for global history; the
next most recent direction (i.e., not the predicted direction)
of previous recent branches for a given global history might be
useful in indexing a global history predictor. This 2-bit
encoding seems to give slightly worse predictions than a
saturating counter but the benefit of "localized" global history
might compensate for this.)
Alloyed prediction ("Alloyed Branch History: Combining Global
and Local Branch History for Robust Performance", Zhijian Lu et
al., 2002) used a tiny amount of local history to index a
(mostly) global history predictor, hiding (much of) the latency
of looking up the local history by retrieving multiple entries
from the table and selecting the appropriate one with the local
history.
There was also a proposal ("Branch Transition Rate: A New Metric
for Improved Branch Classification Analysis", Michael Huangs et
al., 2000) to consider transition rate, noting that high
transition rate branches (which flip direction frequently) are
poorly predicted by averaging behavior. (Obviously, loop-like
branches have a high transition rate for one direction.) This is
a limited type of local history. If I understand correctly, your
state machine mechanism would capture the behavior of such
highly alternately branches.
Compressing the history into a pattern does mean losing
information (as does a counter), but I had thought such pattern
storage might be more efficient that storing local history. It
is also interesting that the Alpha 21264 local predictor used
dynamic pattern matching rather than a static transformation of
history to prediction (state machine)
I think longer local history prediction has become unpopular,
probably because nothing like TAGE was proposed to support
longer histories but also because the number of branches that
can be tracked with long histories is smaller.
Local history patterns may also be less common that statistical
correlation after one has extracted branches predicted well by
global history. (For small-bodied loops, a moderately long
global history provides substantial local history.)
Using a pattern/state machine may also make confidence
estimation less accurate. TAGE can use confidence of multiple
matches to form a prediction.
The use of confidence for making a prediction also makes it
impractical to store just a prediction nearby (to reduce
latency) and have the extra state more physically distant. For
per-address predictors, one could in theory use Icache
replacement to constrain predictor size, where an Icache miss
loads local predictions from an L2 local predictor. I think AMD
had limited local predictors associated with the Icache and had
previously stored some prediction information in L2 cache using
the fact that code is not modified (so parity could be used and
not ECC).
Daniel A. Jiménez did some research on using neural methods
(e.g., perceptrons) for branch prediction. The principle was
that traditional global history tables had exponential scaling
with history size (2 to the N table entries for N history bits)
while per-address perceptrons would scale linearly (for a fixed
number of branch addresses). TAGE (with its folding of long
histories and variable history) seems to have removed this as a
distinct benefit. General correlation with specific branches may
also be less predictive than correlation with path.
Nevertheless, the research was interesting and larger histories
did provide more accurate predictions.
Anyway, I agree that thinking about these things can be fun.
Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).
Could model slightly more complex patterns than the 2-bit saturating
counters, but it is sort of a partial mystery why (for mainstream
processors) more complex lookup schemes and 2 bit state, was
preferable to a simpler lookup scheme and 5-bit state.
Well, apart from the relative "dark arts" needed to cram 4-bit
patterns into a 5 bit FSM (is a bit easier if limiting the patterns to
3 bits).
Then again, had before noted that the LLMs are seemingly also not
really able to figure out how to make a 5 bit FSM to model a full set
of 4 bit patterns.
Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers
could have done similar.
Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have
significantly larger tables (to compensate for the relative predictive
weakness of 2-bit state) would have costed more than the cost of
smaller tables of 6 bit state ?...
Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)
Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000 weak, not taken
001 weak, taken
010 strong, not taken
011 strong, taken
100 weak, alternating, not-taken
101 weak, alternating, taken
110 strong, alternating, not-taken
111 strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.
Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist
as weak states which partly overlap with the weak states for the 3-bit
patterns).
But, yeah, not going to type out state tables for these ones.
Not proven, but I suspect that an arbitrary 5 bit pattern within a 6
bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be
sufficient state-space to distinguish whether to move from a
mismatched 4-bit pattern to a 3 or 5 bit pattern. Whereas, at least
with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-
bit pattern, etc. One needs to be able to express decay both to
shorter patterns and to longer patterns, and I suspect at this point,
the pattern breaks down (but can't easily confirm; it is either this
or the pattern extends indefinitely, I don't know...).
Could almost have this sort of thing as a "brain teaser" puzzle or
something...
Then again, maybe other people would not find any particular
difficulty in these sorts of tasks.
Terje
On 2/9/26 2:33 PM, MitchAlsup wrote:
Paul Clayton <[email protected]> posted:
On 2/5/26 2:02 PM, MitchAlsup wrote:
[snip]>>> MS wants you to buy Office every time you buy a new PC.
I thought MS wanted everyone to use Office365. It is harder to
force people to get a new computer, but a monthly fee will recur
automatically.
When I need a tool--I buy that tool--I never rent that tool.
Name one feature I would want from office365 that was not already
present in office from <say> 1998.
I do not know if MS can legally cancel your MS Office license, and I
doubt the few "software pirates" who continue to use an unsupported ("invalid") version would be worth MS' time and effort to prevent such people from using such software.
However, there seems to be a strong trend toward "you shall own nothing."
MS, then moves all the menu items to different pull downs and
makes it difficult to adjust to the new SW--and then it has the
Gaul to chew up valuable screen space with ever larger pull-
down bars.
Ah, but they are just beginning to include advertising. Imagine
every time one uses the mouse (to indicate to the computer that
the user's eyes are focused on a particular place) an
advertisement appears and follows the cursor movement. Even just
having menu entries that are advertisements would be kind of
annoying, but one would be able to get rid of those by leasing
the premium edition (until one needs to lease the platinum
edition, then the "who wants to remain a millionaire" edition).
Why would I or anyone want advertising in office ????????
Why would anyone want advertising in in a Windows Start Menu?
For Microsoft such provides a bit more revenue/profit as businesses seem willing to pay for such advertisements. Have you ever heard "You are not
the consumer; you are the product"?
I think I read that some streaming services have added
advertising to their (formerly) no-advertising subscriptions, so
the suggested lease term inflation is not completely
unthinkable.
Is it any wonder users want the 1937 milling machine model ???
Have no fear; soon you may be merely leasing your computer.
Computers need to have the latest spyware so that advertisements
can be appropriately targeted and adblocking must be made
impossible.
I am the kind of guy that turns off "telemetry" and places advertisers
in /hosts file.
If all new computers are "leased" (where tampering with the
device — or not connecting it to the Internet such that it can
phone home — revokes "ownership" and not merely warranty and one
agrees to a minimum use [to ensure that enough ads are viewed]),
ordinary users (who cannot assemble devices from commodity
parts) would not have a choice. If governments enforce the
rights of corporations to protect their businesses by outlawing
sale of computer components to anyone who would work around the
cartel, owning a computer could become illegal. Governments have
an interest in having all domestic computers be both secure and
to facilitate domestic surveillance, so mandating features that
remove freedom and require an upgrade cycle (which is also good
for the economy☺) has some attraction.
I doubt people like you are a sufficient threat to profits that
such extreme measures will be used, but the world (and
particularly the U.S.) seems to be becoming somewhat dystopian.
This is getting kind of off-topic and is certainly not something I want
to think about.
I remember reading about the 8080 _ 8086 assembly translator. I
did not know that CP/M and MS-DOS were similar enough to
facilitate porting, so that note was interesting to me.
Intel presumably thought Itanium would be the only merchant
64-bit ISA that mattered (and this would exclude AMD) and
that the masses could use 32-bit until less expensive Itanium
processors were possible.
I agree that such would add complexity, but there is already
complexity for power saving with same ISA heterogeneity. NUMA-
awareness, cache sharing, and cache warmth also complicate
scheduling, so the question becomes how much extra complexity
does such introduce.
I still feel an attraction to a market-oriented resource
management such that threads could both minimize resource use
(that might be more beneficial to others) and get more than a
fair-share of resources that are important.
In article <10n2u02$270jc$[email protected]>, [email protected] (Paul Clayton) wrote:
I remember reading about the 8080 _ 8086 assembly translator. I
did not know that CP/M and MS-DOS were similar enough to
facilitate porting, so that note was interesting to me.
/Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have hierarchical directories. It didn't really support hard disks. The
CP/M-style APIs all carried on existing after MS-DOS 2.0 introduced a new
set of APIs that were more suitable for high-level languages, but they weren't much used un new software.
Intel presumably thought Itanium would be the only merchant
64-bit ISA that mattered (and this would exclude AMD) and
that the masses could use 32-bit until less expensive Itanium
processors were possible.
Pretty much. Then the struggle to make Itanium run fast became the overpowering concern, until they gave up and concentrated on x86-64,
claiming that Itanium would be back in a few years.
I don't think many people took that claim seriously. Some years later, an Intel marketing man was quite shocked to hear that, and that the world
had simply been humouring them.
I agree that such would add complexity, but there is already
complexity for power saving with same ISA heterogeneity. NUMA-
awareness, cache sharing, and cache warmth also complicate
scheduling, so the question becomes how much extra complexity
does such introduce.
If the behaviour of Apple's OSes are any guide, complexity is avoided as
far as possible.
I still feel an attraction to a market-oriented resource
management such that threads could both minimize resource use
(that might be more beneficial to others) and get more than a
fair-share of resources that are important.
The difficulty there is that developers will have a very hard time
creating /measurable/ speed-ups that apply across a wide range of
different configurations. Companies will therefore be reluctant to put developer hours into it that could go into features that customers are
asking for.
John
/Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have
hierarchical directories. ...
My own limited experience with MS-DOS programming mostly showed them
using integer file-handles an a vaguely Unix-like interface for file IO
at the "int 21h" level.
According to BGB <[email protected]>:
/Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have
hierarchical directories. ...
My own limited experience with MS-DOS programming mostly showed them
using integer file-handles an a vaguely Unix-like interface for file IO
at the "int 21h" level.
Yeah, Mark Zbikowski added them along with the tree structred file system in DOS 2.0.
He was at Yale when I was, using a Unix 7th edition system I was supporting.
My own limited experience with MS-DOS programming mostly showed
them using integer file-handles an a vaguely Unix-like interface
for file IO at the "int 21h" level.
Which is, ironically, in conflict with the "FILE *" interface used
by C's stdio API.
Well, apart from some vague (unconfirmed) memories of being exposed
to Pascal via the "Mac Programmer's Workbench" thing at one point
and being totally lost (was very confused, used a CLI but the CLI
commands didn't make sense).
In a way, it showed that they screwed up the design pretty hard
that x86-64 ended up being the faster and more efficient option...
I guess one question is if they had any other particular drawbacks
other than, say:
Their code density was one of the worst around;
128 registers is a little excessive;
128 predicate register bits is a bit WTF;
I guess it is more of an open question of what would have happened,
say, if Intel had gone for an ISA design more like ARM64 or RISC-V
or something.
Well, or something like PowerPC, but then again, IBM still had
difficulty keeping PPC competitive, so dunno. Then again, I think
IBM's PPC issues were more related to trying to keep up in the chip
fab race that was still going strong at the time, rather than an
ISA design issue.
They did. They really did.
I guess one question is if they had any other particular drawbacks
other than, say:
Their code density was one of the worst around;
128 registers is a little excessive;
128 predicate register bits is a bit WTF;
Those huge register files had a lot to do with the low code density. They
had two much bigger problems, though.
They'd correctly understood that the low speed of affordable dynamic RAM
as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads well in advance. They assumed, without evidence, that a compiler with
plenty of time to think could schedule loads better than hardware doing
it dynamically. It's an appealing idea,
but it's wrong.
It might be possible to do that effectively in a single-core,
single-thread, single-task system that isn't taking many (if any)
interrupts. In a multi-core system, running a complex operating system, several multi-threaded applications, and taking frequent interrupts and context switches, it is _not possible_. There is no knowledge of any of
the interrupts, context switches or other applications at compile time,
so the compiler has no idea what is in cache and what isn't. I don't understand why HP and Intel didn't realise this. It took me years, but I
am no CPU designer.
Speculative execution addresses that problem quite effectively. We don't
have a better way, almost thirty years after Itanium design decisions
were taken. They didn't want to do speculative execution, and they close
an instruction format and register set that made adding it later hard. If
it was ever tried, nothing was released that had it AFAIK.
The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
enough ILP to keep those pipelines fed, or they'd just eat cache capacity
and memory bandwidth executing no-ops ... in a very bulky instruction set. They didn't have a general way to extract enough ILP. Nobody does,
even
now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.
There was also an architectural misfeature with floating-point advance
loads that could make them disappear entirely if there was a call
instruction between an advance-load instruction and the corresponding check-load instruction. That cost me a couple of weeks working out and reporting the bug, which was unfixable. The only work-around was to
re-issue all outstanding all floating-point advance-load instruction
after each call returned. The effective code density went down further,
and there were lots of extra read instructions issued.
I guess it is more of an open question of what would have happened,
say, if Intel had gone for an ISA design more like ARM64 or RISC-V
or something.
ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998. RISC-V has
not demonstrated really high performance yet, and it's been around long enough that I'm starting to doubt it ever will.
Well, or something like PowerPC, but then again, IBM still had
difficulty keeping PPC competitive, so dunno. Then again, I think
IBM's PPC issues were more related to trying to keep up in the chip
fab race that was still going strong at the time, rather than an
ISA design issue.
I think that was fabs, rather than architecture.
While I was providing libraries for PowerPC (strictly, POWER4, POWER5 and POWER6, one after another) it always had rather decent performance for its clockspeed and process.--- Synchronet 3.21b-Linux NewsLink 1.2
John
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
Now, 30 years later the compilers are still in the position of having
made LITTLE progress.
I suspect a big part of the problem was tension between Intel and HP
were the only political solution was allowing the architects from both
sides to "dump in" their favorite ideas. A recipe for disaster.
On 2/19/2026 1:59 PM, John Levine wrote:
According to BGB <[email protected]>:
/Early/ MS-DOS. That used CPM-like File Control Blocks, and didn't have >>>> hierarchical directories. ...
My own limited experience with MS-DOS programming mostly showed them
using integer file-handles an a vaguely Unix-like interface for file IO
at the "int 21h" level.
Yeah, Mark Zbikowski added them along with the tree structred file
system in DOS 2.0.
He was at Yale when I was, using a Unix 7th edition system I was
supporting.
Looks it up...
Yeah, my case, I didn't exist yet when the MS-DOS 2.x line came out...
Did exist for the 3.x line though.
I don't remember much from those years though.
Some fragmentary memories implied that (in that era) had mostly been watching shows like Care Bears and similar (but, looking at it at a
later age, found it mostly unwatchable). I think also shows like Smurfs
and Ninja Turtles and similar, etc.
Like, at some point, memory breaking down into sort of an amorphous mass
of things from TV shows all just sort of got mashed together. Not much > stable memory of things other than fragments of TV shows and such.
Not sure what the experience is like for most people though.My memory from before the age of 4 is extremely spotty, just a couple of situations that made a lasting impact.
My own limited experience with MS-DOS programming mostly showed
them using integer file-handles an a vaguely Unix-like interface
for file IO at the "int 21h" level.
Which is, ironically, in conflict with the "FILE *" interface used
by C's stdio API.
However, it's entirely concordant with Unix's lower-level file
descriptors, as used in the read() and write() calls.
<https://en.wikipedia.org/wiki/File_descriptor> <https://en.wikipedia.org/wiki/Read_(system_call)>
The FILE* interface is normally implemented on top of the lower-level
calls, with a buffer in the process' address space, managed by the C
run-time library. The file descriptor is normally a member of the FILE structure.
MS-DOS is not a great design, but it isn't crazy either.
Well, apart from some vague (unconfirmed) memories of being exposed
to Pascal via the "Mac Programmer's Workbench" thing at one point
and being totally lost (was very confused, used a CLI but the CLI
commands didn't make sense).
I used it very briefly. It was a very weird CLI, seemingly designed by someone opposed to the basic idea of a CLI.
In a way, it showed that they screwed up the design pretty hard
that x86-64 ended up being the faster and more efficient option...
They did. They really did.
I guess one question is if they had any other particular drawbacks
other than, say:
Their code density was one of the worst around;
128 registers is a little excessive;
128 predicate register bits is a bit WTF;
Those huge register files had a lot to do with the low code density. They
had two much bigger problems, though.
They'd correctly understood that the low speed of affordable dynamic RAM
as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads well in advance. They assumed, without evidence, that a compiler with
plenty of time to think could schedule loads better than hardware doing
it dynamically. It's an appealing idea, but it's wrong.
It might be possible to do that effectively in a single-core,
single-thread, single-task system that isn't taking many (if any)
interrupts. In a multi-core system, running a complex operating system, several multi-threaded applications, and taking frequent interrupts and context switches, it is _not possible_. There is no knowledge of any of
the interrupts, context switches or other applications at compile time,
so the compiler has no idea what is in cache and what isn't. I don't understand why HP and Intel didn't realise this. It took me years, but I
am no CPU designer.
Speculative execution addresses that problem quite effectively. We don't
have a better way, almost thirty years after Itanium design decisions
were taken. They didn't want to do speculative execution, and they close
an instruction format and register set that made adding it later hard. If
it was ever tried, nothing was released that had it AFAIK.
The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
enough ILP to keep those pipelines fed, or they'd just eat cache capacity
and memory bandwidth executing no-ops ... in a very bulky instruction set. They didn't have a general way to extract enough ILP. Nobody does, even
now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.
There was also an architectural misfeature with floating-point advance
loads that could make them disappear entirely if there was a call
instruction between an advance-load instruction and the corresponding check-load instruction. That cost me a couple of weeks working out and reporting the bug, which was unfixable. The only work-around was to
re-issue all outstanding all floating-point advance-load instruction
after each call returned. The effective code density went down further,
and there were lots of extra read instructions issued.
I guess it is more of an open question of what would have happened,
say, if Intel had gone for an ISA design more like ARM64 or RISC-V
or something.
ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998. RISC-V has
not demonstrated really high performance yet, and it's been around long enough that I'm starting to doubt it ever will.
Well, or something like PowerPC, but then again, IBM still had
difficulty keeping PPC competitive, so dunno. Then again, I think
IBM's PPC issues were more related to trying to keep up in the chip
fab race that was still going strong at the time, rather than an
ISA design issue.
I think that was fabs, rather than architecture. While I was providing libraries for PowerPC (strictly, POWER4, POWER5 and POWER6, one after another) it always had rather decent performance for its clockspeed and process.
John
On 2/19/2026 5:10 PM, John Dallman wrote: ------------------------------------
This can be used to add resistance against stack-stomping via buffer overflows, but is potentially risky with RISC-V:
AUIPC X1, AddrHi
JALR X0, AddrLo(X1)
Can nuke the process, when officially it is allowed (vs forcing the use
of a different register to encode a long branch).
Like, how about one not try to bake in assumptions about 1-cycle ALU and 2-cycle Load being practical?...
Vs, say, 2-cycle ALU ops and 3-cycle Loads; with an ideal of putting 5 instructions between an instruction that generates a result and the instruction that consumes the result as this is more likely to work with in-order superscalar.
But, then one runs into the issue that if a basic operation then
requires a multi-op sequence, the implied latency goes up considerably
(say, could call this "soft latency", or SL).
So, for example, it means that, say:
2-instruction sign extension:
RV working assumption: 2 cycles
Hard latency (2c ALU): 4 cycles
Soft latency: 12 cycles.
For a 3-op sequence, the effective soft-latency goes up to 18, ...
And, in cases where the soft-latency significantly exceeds the total
length of the loop body, it is no longer viable to schedule the loop efficiently.
So, in this case, an indexed-load instruction has an effective 9c SL, whereas SLLI+ADD+LD has a 21 cycle SL.
where, in this case, the goal of something like the WEXifier is to
minimize this soft-latency cost (in cases where a dependency is seen,
any remaining soft-latency is counted as penalty).
But, then again, maybe the concept of this sort of "soft latency" seems
a bit alien.
Granted, not sure how this maps over to OoO, but had noted that even
with modern CPUs, there still seems to be benefit from assuming a sort
of implicit high latency for instructions over assuming a lower latency.
*1: Where people argue that if each vendor can do a CPU with their own custom ISA variants and without needing to license or get approval from
a central authority, that invariably everything would decay into an incoherent mess where there is no binary compatibility between
processors from different vendors (usual implication being that people
are then better off staying within the ARM ecosystem to avoid RV's lawlessness).
BGB <[email protected]> posted:
On 2/19/2026 5:10 PM, John Dallman wrote:
------------------------------------
This can be used to add resistance against stack-stomping via buffer
overflows, but is potentially risky with RISC-V:
AUIPC X1, AddrHi
JALR X0, AddrLo(X1)
Can nuke the process, when officially it is allowed (vs forcing the use
of a different register to encode a long branch).
That should be:
AUPIC x1,hi(offset)
JALR x0,lo(offset)
using:
SETHI x1,AddrHi
JALR x0,AddrLo
would work.
---------------------
Like, how about one not try to bake in assumptions about 1-cycle ALU and
2-cycle Load being practical?...
for the above to work::
ALU is < ½ cycle leaving ¼ cycle output drive and ¼ cycle input mux
SRAM is ½ cycle, AGEN to SRAM decode is ¼ cycle, SRAM output to shifter
is < ¼ cycle, and set-selection is ½ cycle; leaving ¼ cycle for output drive.
Vs, say, 2-cycle ALU ops and 3-cycle Loads; with an ideal of putting 5
instructions between an instruction that generates a result and the
instruction that consumes the result as this is more likely to work with
in-order superscalar.
1-cycle ALU with 3 cycle LD is not very hard at 16-gates per cycle.
2-cycle LD is absolutely impossible with 1-cycle addr-in to data-out
SRAM. So, we generally consider any design with 2-cycle LD to be
frequency limited.
But, then one runs into the issue that if a basic operation then
requires a multi-op sequence, the implied latency goes up considerably
(say, could call this "soft latency", or SL).
So, for example, it means that, say:
2-instruction sign extension:
RV working assumption: 2 cycles
Hard latency (2c ALU): 4 cycles
Soft latency: 12 cycles.
For a 3-op sequence, the effective soft-latency goes up to 18, ...
One of the reasons a 16-gate design works better in practice than
a 12-gate design. And why a 1-cycle ALU, 3-cycle LD runs at higher
frequency.
And, in cases where the soft-latency significantly exceeds the total
length of the loop body, it is no longer viable to schedule the loop
efficiently.
In software, there remains no significant problem running the loop
in HW.
So, in this case, an indexed-load instruction has an effective 9c SL,
whereas SLLI+ADD+LD has a 21 cycle SL.
3-cycle indexed LD with cache hit in may µArchitectures--with scaled indexing. This is one of the driving influences of "raising" the
semantic content of LD/ST instructions to [Rbase+Rindex<<sc+Disp]
where, in this case, the goal of something like the WEXifier is to
minimize this soft-latency cost (in cases where a dependency is seen,
any remaining soft-latency is counted as penalty).
But, then again, maybe the concept of this sort of "soft latency" seems
a bit alien.
Those ISAs without scaled indexing have longer effective latency through cache than those with: those without full range Dsip have similar problems: those without both are effectively adding 3-4 cycles to LD latency.
Which is why the size of the execution windows grew from 60-ish to 300-ish
to double performance--the ISA is adding latency and the size of execution window is the easiest way to absorb such latency.
{{60-ish ~= Athlon; 300-ish ~= M4}}
Granted, not sure how this maps over to OoO, but had noted that even
with modern CPUs, there still seems to be benefit from assuming a sort
of implicit high latency for instructions over assuming a lower latency.
Execution window size is how it maps.
*1: Where people argue that if each vendor can do a CPU with their own
custom ISA variants and without needing to license or get approval from
a central authority, that invariably everything would decay into an
incoherent mess where there is no binary compatibility between
processors from different vendors (usual implication being that people
are then better off staying within the ARM ecosystem to avoid RV's
lawlessness).
RISC-V seems to be "eating" a year (or a bit more) to bring this mess into
a coherent framework.
In a way, it showed that they screwed up the design pretty hard
that x86-64 ended up being the faster and more efficient option...
They did. They really did.
I guess one question is if they had any other particular drawbacks
other than, say:
Their code density was one of the worst around;
128 registers is a little excessive;
128 predicate register bits is a bit WTF;
They
had two much bigger problems, though.
They'd correctly understood that the low speed of affordable dynamic RAM
as compared to CPUs running at hundreds of MHz was the biggest barrier to >making code run fast.
Their solution was have the compiler schedule loads
well in advance. They assumed, without evidence, that a compiler with
plenty of time to think could schedule loads better than hardware doing
it dynamically. It's an appealing idea, but it's wrong.
It might be possible to do that effectively in a single-core,
single-thread, single-task system that isn't taking many (if any)
interrupts. In a multi-core system, running a complex operating system, >several multi-threaded applications, and taking frequent interrupts and >context switches, it is _not possible_. There is no knowledge of any of
the interrupts, context switches or other applications at compile time,
so the compiler has no idea what is in cache and what isn't.
Speculative execution addresses that problem quite effectively.
We don't
have a better way, almost thirty years after Itanium design decisions
were taken. They didn't want to do speculative execution
and they close
an instruction format and register set that made adding it later hard.
The other problem was that they had three (or six, or twelve) in-order >pipelines running in parallel. That meant the compilers had to provide
enough ILP to keep those pipelines fed, or they'd just eat cache capacity
and memory bandwidth executing no-ops ...
They didn't have a general way to extract enough ILP.
I guess it is more of an open question of what would have happened,
say, if Intel had gone for an ISA design more like ARM64 or RISC-V
or something.
ARM64 seems to me to be the product of a lot more experience with >speculatively-executing processors than was available in 1998.
RISC-V has
not demonstrated really high performance yet, and it's been around long >enough that I'm starting to doubt it ever will.
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
I suspect a big part of the problem was tension between Intel and HP
were the only political solution was allowing the architects from both
sides to "dump in" their favorite ideas. A recipe for disaster.
The odd thing is that these were hardware companies betting on "someone
else" solving their problem, yet if compiler people truly had managed to >solve those problems, then other hardware companies could have taken >advantage just as well.
To me the main question is whether they were truly confused and just got >lucky (lucky because they still managed to sell their idea enough that
most RISC companies folded),
On 2/20/2026 5:49 PM, MitchAlsup wrote:
----------------------------
There is a non-zero risk though when one disallows uses that are theoretically allowed in the ISA, even if GCC doesn't use them.
Well, and in terms of typical ASM notation, there is this mess:
(Rb) / @Rb / @(Rb) //load/store register
(Rb, Disp) / Disp(Rb) //load/store disp
@(Rb, Disp) / @(Disp, Rb) //load/store disp (but with @)
Then:
(Rb, Ri) //indexed (element sized index)
Ri(Rb) //indexed (byte-scaled index)
(Rb, Ri, Sc) //indexed with scale
Disp(Rb, Ri) //indexed with displacement
Disp(Rb, Ri, Sc) //indexed with displacement and scale
Then:
@Rb+ / (Rb)+ //post-increment
@-Rb / -(Rb) //pre-decrement
@Rb- / (Rb)- //post-decrement
@+Rb / +(Rb) //pre-increment
And, in some variants, all the registers prefixed with '%'.
[email protected] (John Dallman) writes:----------------
[some unattributed source writes:]
Actually, other architectures also added prefetching instructions for
dealing with that problem. All I have read about that was that there
were many disappointments when using these instructions. I don't know
if there were any successes, and how frequent they were compared to
the disappointments. So I don't see that IA-64 was any different from
other architectures in that respect.
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
----------------------------
I have certainly read about interesting results for binary search (in
an array) where the branching version outperforms the branchless
version because the branching version speculatively executes several
followup loads, and even in unpredictable cases the speculation should
be right in 50% of the cases, resulting in an effective doubling of memory-level parallelism (but at a much higher cost in memory
subsystem load). But other than that, I don't see that speculative
execution helps with memory latency.
The instruction format makes no difference. Having so many registers
may have mad it harder than otherwise, but SPARC also used many
registers, and we have seen OoO implementations (and discussed them a
few months ago). The issue is that speculative execution and OoO
makes all the EPIC features of IA-64 unnecessary, so if they cannot do
a fast in-order implementation of IA-64 (and they could not), they
should just give up and switch to an architecture without these
features, such as AMD64. And Intel did, after a few years of denying.
The remainder of IA-64 was just to keep it alive for the customers who
had bought into it.
For a few cases, they have. But the problem is that these cases are
also vectorizable.
Their major problem was that they did not get enough ILP from
general-purpose code.
And even where they had enough ILP, OoO CPUs with SIMD extensions ate
their lunch.
- anton--- Synchronet 3.21b-Linux NewsLink 1.2
Stefan Monnier <[email protected]> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
That did not solve memory latency, but that's a problem even for OoO
cores.
I suspect a big part of the problem was tension between Intel and HP
were the only political solution was allowing the architects from both
sides to "dump in" their favorite ideas. A recipe for disaster.
The HP side had people like Bob Rau (Cydrome) and Josh Fisher
(Multiflow), and given their premise, the architecture is ok; somewhat
on the complex side, but they wanted to cover all the good ideas from
earlier designs; after all, it was to be the one architecture to rule
them all (especially performancewise). You cannot leave out a feature
that a competitor could then add to outperform IA-64.
To me the main question is whether they were truly confused and just got >lucky (lucky because they still managed to sell their idea enough that
most RISC companies folded),
I think most RISC companies had troubles scaling. They were used to
small design teams spinning out simple RISCs in a short time, and did
not have the organization to deal with the much larger projects that
OoO superscalars required.
And while everybody inventing their own architecture may have looked like a good idea when developing an
architecture and its implementations was cheap,
it looked like a bad--- Synchronet 3.21b-Linux NewsLink 1.2
deal when development costs started to ramp up in the mid-90s. That's
why HP went to Intel, and other companies (in particular, SGI) took
this as an exit strategy from the own-RISC business.
DEC had increasing delays in their chips, and eventually could not
make enough money with them and had to sell themselves to Compaq (who
also could not sustain the effort and sold themselves to HP (who
canceled Alpha development)). I doubt that IA-64 played a big role in
that game.
Back to IA-64: At the time, when OoO was just starting, the premise of
IA-64 looked plausible. Why wouldn't they see a fast clock rate and
higher ILP from explicit parallelism than conventional architectures
would see from OoO (apparently complex, and initially without anything
like IA-64's ALAT)?
- anton
BGB <[email protected]> posted:
On 2/20/2026 5:49 PM, MitchAlsup wrote:
----------------------------
There is a non-zero risk though when one disallows uses that are
theoretically allowed in the ISA, even if GCC doesn't use them.
This is why one must decode all 32-bits of each instruction--so that
there is no hole in the decoder that would allow the core to do some-
thing not directly specified in ISA. {And one of the things that make
an industrial quality ISA so hard to fully specify.}}
---------------------
Well, and in terms of typical ASM notation, there is this mess:
(Rb) / @Rb / @(Rb) //load/store register
(Rb, Disp) / Disp(Rb) //load/store disp
@(Rb, Disp) / @(Disp, Rb) //load/store disp (but with @)
Then:
(Rb, Ri) //indexed (element sized index)
Ri(Rb) //indexed (byte-scaled index)
(Rb, Ri, Sc) //indexed with scale
Disp(Rb, Ri) //indexed with displacement
Disp(Rb, Ri, Sc) //indexed with displacement and scale
Then:
@Rb+ / (Rb)+ //post-increment
@-Rb / -(Rb) //pre-decrement
@Rb- / (Rb)- //post-decrement
@+Rb / +(Rb) //pre-increment
And, in some variants, all the registers prefixed with '%'.
Leading to SERIAL DECODE--which is BAD.
-----------------------
On 2/21/2026 2:15 PM, MitchAlsup wrote:Whether your ISA can be attacked with Spectré and/or Meltdown;
BGB <[email protected]> posted:
On 2/20/2026 5:49 PM, MitchAlsup wrote:
----------------------------
There is a non-zero risk though when one disallows uses that are
theoretically allowed in the ISA, even if GCC doesn't use them.
This is why one must decode all 32-bits of each instruction--so that
there is no hole in the decoder that would allow the core to do some-
thing not directly specified in ISA. {And one of the things that make
an industrial quality ISA so hard to fully specify.}}
---------------------
Sometimes there is a tension:
What is theoretically allowed in the ISA;
What is the theoretically expected behavior in some abstract model;
What stuff is actually used by compilers;
What features or behaviors does one want;
...
Implementing RISC-V strictly as per an abstract model would limit both efficiency and hinder some use-cases.
Then it comes down to "what do compilers do" and "what unintentional behaviors could an ASM programmer stumble onto unintentionally".
Stefan Monnier <[email protected]> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
But as computer hardware got faster and denser, it became possible to
do the scheduling on the fly in hardware, so you could get comparable >performance with conventional instruction sets in a microprocessor.
[email protected] (Anton Ertl) posted:[1) stride-based. 2) pointer-chasing]
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
Array and Matrix scientific codes with datasets bigger than cache.
I have certainly read about interesting results for binary search (in
an array) where the branching version outperforms the branchless
version because the branching version speculatively executes several
followup loads, and even in unpredictable cases the speculation should
be right in 50% of the cases, resulting in an effective doubling of
memory-level parallelism (but at a much higher cost in memory
subsystem load). But other than that, I don't see that speculative
execution helps with memory latency.
At the cost of opening the core up to Spectré-like attacks.
[email protected] (Anton Ertl) posted:
The HP side had people like Bob Rau (Cydrome) and Josh Fisher
(Multiflow), and given their premise, the architecture is ok; somewhat
on the complex side, but they wanted to cover all the good ideas from
earlier designs; after all, it was to be the one architecture to rule
them all (especially performancewise). You cannot leave out a feature
that a competitor could then add to outperform IA-64.
In this time period, performance was doubling every 14 months, so if a >feature added x performance it MUST avoid adding more than x/14 months
to the schedule. If IA-64 was 2 years earlier, it would have been com- >petitive--sadly it was not.
MitchAlsup <[email protected]d> writes:
[email protected] (Anton Ertl) posted:[1) stride-based. 2) pointer-chasing]
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come
to mind, but are they really common, apart from programming
classes?
Array and Matrix scientific codes with datasets bigger than cache.
The dense cases are covered by stride-based hardware predictors, so
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
The recent comparisons of branchy vs branchless binary search that weI have certainly read about interesting results for binary search
(in an array) where the branching version outperforms the
branchless version because the branching version speculatively
executes several followup loads, and even in unpredictable cases
the speculation should be right in 50% of the cases, resulting in
an effective doubling of memory-level parallelism (but at a much
higher cost in memory subsystem load). But other than that, I
don't see that speculative execution helps with memory latency.
At the cost of opening the core up to Spectr_�-like attacks.
There may be a way to avoid the side channel while still supporting
this scenario. But I think that there are better ways to speed up
such a binary search:
Here software prefetching can really help: prefetch one level ahead (2 prefetches), or two levels ahead (4 prefetches), three levels (8
prefetches), or four levels (16 prefetches), etc., whatever gives the
best performance (which may be hardware-dependent). The result is a
speedup of the binary search by (in the limit) levels+1.
By contrast, the branch prediction "prefetching" provides a factor 1.5
at twice the number of loads when one branch is predicted, 1.75 at 3x
the number of loads when two branches are predicted, etc. up to a
speedup factor of 2 with an infinite of loads and predicted branches;
that's for completely unpredictable lookups, with some predictability
the branch prediction approach performs better, and with good
predictability it should outdo the software-prefetching approach for
the same number of additional memory accesses.
- anton
Stefan Monnier <[email protected]> writes:
MitchAlsup <[email protected]d> wrote:Actually, the IA-64 people could point to the work on VLIW (in
At the time of conception, there were amny arguments that {sooner orI can't remember seeing such arguments comping from compiler people, tho.
later} compilers COULD figure stuff like this out.
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
The major problem was that the premise was wrong. They assumed that
in-order would give them a clock rate edge, but that was not the case,
right from the start (The 1GHz Itanium II (released July 2002)
competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
XP (released June 2002)). They also assumed that explicit parallelism
would provide at least as much ILP as hardware scheduling of OoO CPUs,
but that was not the case for general-purpose code, and in any case,
they needed a lot of additional ILP to make up for their clock speed disadvantage.
The odd thing is that these were hardware companies betting on "someone >>else" solving their problem, yet if compiler people truly had managed to >>solve those problems, then other hardware companies could have taken >>advantage just as well.I am sure they had patents on stuff like the advanced load and the
ALAT, so no, other hardware companies would have had a hard time.
According to Anton Ertl <[email protected]>:
Stefan Monnier <[email protected]> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software >pipelining)), which in turn is based on the work on compilers for >microcode.
I knew the Multiflow people pretty well when I was at Yale. Trace
scheduling was inspired by the FPS AP-120B, which had wide
instructions issuing multiple operations and was extremely hard to
program efficiently.
Multiflow's compiler worked pretty well and did a good job of static scheduling memory operations when the access patterns weren't too data dependent. It was good enough that Intel and HP both licensed it and
used it in their VLIW projects. It was a good match for the hardware
in the 1980s.
But as computer hardware got faster and denser, it became possible to
do the scheduling on the fly in hardware, so you could get comparable performance with conventional instruction sets in a microprocessor.
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
Of course, compiler people have worked on such problems and solved some >cases. But what I wrote above is that "I can't remember seeing
... compiler people" claiming that "{sooner or later} compilers COULD
figure stuff like this out".
Does imply that my younger self was notable, and not seen as just
some otherwise worthless nerd.
For 128 predicate registers, this part doesn't make as much sense:
*1: Where people argue that if each vendor can do a CPU with their
own custom ISA variants and without needing to license or get
approval from a central authority, that invariably everything would
decay into an incoherent mess where there is no binary
compatibility between processors from different vendors (usual
implication being that people are then better off staying within
the ARM ecosystem to avoid RV's lawlessness).
Apropos another thread I can believe that IA-64 was obsolete before it was shipped
for that reason, static scheduling will never keep up with dynamic except in >applications where the access patterns are predictable.
Are there enough applications like that to make VLIWs worth it? Some kinds of DSP?
John Levine <[email protected]> writes:
Apropos another thread I can believe that IA-64 was obsolete before it was shipped
for that reason, static scheduling will never keep up with dynamic except in >>applications where the access patterns are predictable.
Concerning the scheduling, hardware scheduling looked pretty dumb at
the time (always schedule the oldest ready instruction(s)), ...
Another aspect where hardware is far superior is branch prediction.
According to Anton Ertl <[email protected]>:
John Levine <[email protected]> writes:
Apropos another thread I can believe that IA-64 was obsolete before it was shipped
for that reason, static scheduling will never keep up with dynamic except in >>>applications where the access patterns are predictable.
Concerning the scheduling, hardware scheduling looked pretty dumb at
the time (always schedule the oldest ready instruction(s)), ...
I was thinking of memory scheduling. You have multiple banks of memory
each of which can only do one fetch or store at a time, and the goal
was to keep all of the banks as busy as possible. If you're accessing
an array in predictable order, trace scheduling works well, but if
you're fetching a[b[i]] where b varies at runtime, it doesn't.
Another aspect where hardware is far superior is branch prediction.
I gather speculative execution of both branch paths worked OK if the
branch tree wasn't too bushy.
There were certainly ugly details, e.g.,
if there's a trap on a path that turns out not to be taken.
On Sun, 22 Feb 2026 13:17:30 GMT
[email protected] (Anton Ertl) wrote:
MitchAlsup <[email protected]d> writes:
[1) stride-based. 2) pointer-chasing]
[email protected] (Anton Ertl) posted: =20
The dense cases are covered by stride-based hardware predictors, soOtherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come
to mind, but are they really common, apart from programming
classes? =20
Array and Matrix scientific codes with datasets bigger than cache. =20 >>=20
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive even >without hardware predictors.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's no
longer true.
The recent comparisons of branchy vs branchless binary search that we
carried on RWT forum seems to suggest that on modern CPUs branchless
variant is faster even when the table does not fit in LLC.=20
Branchy variant manages to pull ahead only when TLB misses can't be
served from L2$.
At least that how I interpreted it.
Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974
And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895
Michael S <[email protected]> writes:
On Sun, 22 Feb 2026 13:17:30 GMT
[email protected] (Anton Ertl) wrote:
MitchAlsup <[email protected]d> writes:
[1) stride-based. 2) pointer-chasing]
[email protected] (Anton Ertl) posted: =20
=20Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays
come to mind, but are they really common, apart from programming
classes? =20
Array and Matrix scientific codes with datasets bigger than
cache. =20
The dense cases are covered by stride-based hardware predictors, so
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive
even without hardware predictors.
There are HPC applications that are bandwidth limited; that's why they
have the roofline performance model (e.g., <https://docs.nersc.gov/tools/performance/roofline/>).
Of course for
the memory-bound applications the uarch style is not particularly
important, as long as it allows to exploit the memory-level
parallelism in some way; prefetching (by hardware or software) is a
way that can be performed by any kind of uarch to exploit MLP.
IA-64 implementations have performed relatively well for SPECfp. I
don't know how memory-bound these applications were, though.
For those applications that benefit from caches (e.g., matrix multiplication), memory-level parallelism is less important.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's no
longer true.
I very much doubt that. There is little point in adding an
instruction that slows down execution by turning it from compute-bound
to memory-bound.
The recent comparisons of branchy vs branchless binary search that we >carried on RWT forum seems to suggest that on modern CPUs branchless >variant is faster even when the table does not fit in LLC.=20
Only two explanations come to my mind:
1) The M3 has a hardware prefetcher that recognizes the pattern of a
binary array search and prefetches accordingly. The cache misses from
page table accesses might confuse the prefetcher, leading to worse performance eventually.
2) (doubtful) The compiler recognizes the algorithm and inserts
software prefetch instructions.
Branchy variant manages to pull ahead only when TLB misses can't be
served from L2$.
At least that how I interpreted it.
Here is result on very modern CPU: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223974
And here is older gear: >https://www.realworldtech.com/forum/?threadid=3D223776&curpostid=3D223895
I tried to run your code <https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
on Zen4, but clang-14 converts uut2.c into branchy code. I could not
get gcc-12 to produce branchless code from slightly adapted source
code, either. My own attempt of using extended asm did not pass your
sanity checks, so eventually I used the assembly code produced by
clang-19 through godbolt.
Here I see that branchy is faster for the
100M array size on Zen4 (i.e., where on the M3 branchless is faster):
[~/binary-search:165562] perf stat branchless 100000000 10000000 10 4041.093082 msec. 0.404109 usec/point
Performance counter stats for 'branchless 100000000 10000000 10':
45436.88 msec task-clock 1.000 CPUs utilized
276 context-switches 6.074 /sec
0 cpu-migrations 0.000 /sec
1307 page-faults 28.765 /sec 226091936865 cycles 4.976 GHz
1171349822 stalled-cycles-frontend 0.52% frontend cycles idle 34666912723 instructions 0.15 insn per cycle
0.03 stalled cycles per insn
5829461905 branches 128.298 M/sec
45963810 branch-misses 0.79% of all branches
45.439541729 seconds time elapsed
45.397207000 seconds user
0.040008000 seconds sys
[~/binary-search:165563] perf stat branchy 100000000 10000000 10
3051.269998 msec. 0.305127 usec/point
Performance counter stats for 'branchy 100000000 10000000 10':
34308.48 msec task-clock 1.000 CPUs utilized
229 context-switches 6.675 /sec
0 cpu-migrations 0.000 /sec
1307 page-faults 38.096 /sec 172472673652 cycles 5.027 GHz
49176146462 stalled-cycles-frontend 28.51% frontend cycles idle 33346311766 instructions 0.19 insn per cycle
1.47 stalled cycles per insn
10322173655 branches 300.864 M/sec
1583842955 branch-misses 15.34% of all branches
34.311432848 seconds time elapsed
34.264437000 seconds user
0.044005000 seconds sys
If you have recommendations what to use for the other parameters, I
can run other sizes as well.
- anton
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
I gather speculative execution of both branch paths worked OK if the
branch tree wasn't too bushy. There were certainly ugly details,
e.g., if there's a trap on a path that turns out not to be taken.
[email protected] (Anton Ertl) posted:
IA-64 was to prevent AMD (and others) from clones--that is all. Intel/HP would have had a patent wall 10 feet tall.
It did not fail by being insufficiently protected, it failed because it
did not perform.
-----------------
For a few cases, they have. But the problem is that these cases are
also vectorizable.
Their major problem was that they did not get enough ILP from
general-purpose code.
And even where they had enough ILP, OoO CPUs with SIMD extensions ate
their lunch.
And that should have been the end of the story. ... Sorry Ivan.
On 2/16/2026 3:14 PM, Paul Clayton wrote:
On 11/5/25 2:00 AM, BGB wrote:
On 11/4/2025 3:44 PM, Terje Mathisen wrote:[snip]
MitchAlsup wrote:
[email protected] (Anton Ertl) posted:
Branch prediction is fun.
When I looked around online before, a lot of stuff about
branch prediction was talking about fairly large and
convoluted schemes for the branch predictors.
You might be interested in looking at the 6th Championship
Branch Prediction (2025): https://ieeetcca.org/2025/02/18/6th-
championship-branch-prediction-cbp2025/
Quick look, didn't see much information about who entered or won...
TAgged GEometric length predictors (TAGE) seem to the current
"hotness" for branch predictors. These record very long global
histories and fold them into shorter indexes with the number of
history bits used varying for different tables.
(Because the correlation is less strong, 3-bit counters are
generally used as well as a useful bit.)
When I messed with it, increasing the strength of the saturating
counters was not effective.
But, increasing the ability of them to predict more complex
patterns did help.
[snip]But, then always at the end of it using 2-bit saturating
counters:
weakly taken, weakly not-taken, strongly taken, strongly
not taken.
But, in my fiddling, there was seemingly a simple but
moderately effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Indexing a predictor by _local_ (i.e., per instruction address)
history adds a level of indirection; once one has the branch
(fetch) address one needs to index the local history and then
use that to index the predictor.
OK, seems I wrote it wrong:
I was using a global branch history.
But, either way, the history is XOR'ed with the relevant bits
from PC to generate the index.
I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.
If the 5/6-bit FSM can fit more patterns than 3x 2-bit
saturating counters, it can be a win.
As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.
With the PC XOR Hist, lookup, there were still quite a few
patterns, and not just contiguous 0s or 1s that the saturating
counter would predict, nor just all 0s, all 1s, or ...101010...
that the 3-bit FSM could deal with.
But, a bigger history could mean less patterns and more
contiguous bits in the state.
TAGE has the advantage that the tags reduce branch aliases and
the variable history length (with history folding/compression)
allows using less storage (when a prediction only benefits from
a shorter history) and reduces training time.
In my case, was using 6-bit lookup mostly to fit into LUT6 based
LUTRAM.
Going bigger than 6 bits here is a pain point for FPGAs, more so
as BRAM's don't support narrow lookups, so the next size up
would likely be 2048x but then inefficiently using the BRAM
(there isn't likely a particularly good way to make use of 512x
18-bits).
Local history patterns may also be less common that statistical
correlation after one has extracted branches predicted well by
global history. (For small-bodied loops, a moderately long
global history provides substantial local history.)
It seems what I wrote originally was inaccurate, I don't store a
history per-target, merely it was recent taken/not-taken branches.
But, I no longer remember what I was thinking at the time, or
why I had written local history rather than global history
(unless I meant "local" in terms of recency or something, I
don't know).
A lot of this seems a lot more complex though than what would be
all that practical on a Spartan or Artix class FPGA.
I was mostly using 5/6 bit state machines as they gave better
results than 2-bit saturating counters, and fit nicely within
the constraints of a "history XOR PC" lookup pattern.
On 2/19/26 6:04 PM, BGB wrote:
In a way, it showed that they screwed up the design pretty hard
that x86-64 ended up being the faster and more efficient option...
They did. They really did.
I guess one question is if they had any other particular drawbacks
other than, say:
Their code density was one of the worst around;
128 registers is a little excessive;
128 predicate register bits is a bit WTF;
Those huge register files had a lot to do with the low code density. They
had two much bigger problems, though.
They'd correctly understood that the low speed of affordable dynamic RAM
as compared to CPUs running at hundreds of MHz was the biggest barrier to making code run fast. Their solution was have the compiler schedule loads well in advance. They assumed, without evidence, that a compiler with
plenty of time to think could schedule loads better than hardware doing
it dynamically. It's an appealing idea, but it's wrong.
It might be possible to do that effectively in a single-core,
single-thread, single-task system that isn't taking many (if any)
interrupts. In a multi-core system, running a complex operating system, several multi-threaded applications, and taking frequent interrupts and context switches, it is _not possible_. There is no knowledge of any of
the interrupts, context switches or other applications at compile time,
so the compiler has no idea what is in cache and what isn't. I don't understand why HP and Intel didn't realise this. It took me years, but I
am no CPU designer.
Speculative execution addresses that problem quite effectively. We don't
have a better way, almost thirty years after Itanium design decisions
were taken. They didn't want to do speculative execution, and they close
an instruction format and register set that made adding it later hard. If
it was ever tried, nothing was released that had it AFAIK.
The other problem was that they had three (or six, or twelve) in-order pipelines running in parallel. That meant the compilers had to provide
enough ILP to keep those pipelines fed, or they'd just eat cache capacity
and memory bandwidth executing no-ops ... in a very bulky instruction set. They didn't have a general way to extract enough ILP. Nobody does, even
now. They just assumed that with an army of developers they'd find enough heuristics to make it work well enough. They didn't.
There was also an architectural misfeature with floating-point advance
loads that could make them disappear entirely if there was a call
instruction between an advance-load instruction and the corresponding check-load instruction.
That cost me a couple of weeks working out and
reporting the bug, which was unfixable. The only work-around was to
re-issue all outstanding all floating-point advance-load instruction
after each call returned. The effective code density went down further,
and there were lots of extra read instructions issued.
I guess it is more of an open question of what would have happened,
say, if Intel had gone for an ISA design more like ARM64 or RISC-V
or something.
ARM64 seems to me to be the product of a lot more experience with speculatively-executing processors than was available in 1998.
RISC-V has
not demonstrated really high performance yet, and it's been around long enough that I'm starting to doubt it ever will.
In article <10ngao4$d5o$[email protected]>, [email protected] (John Levine)
wrote:
I gather speculative execution of both branch paths worked OK if the
branch tree wasn't too bushy. There were certainly ugly details,
e.g., if there's a trap on a path that turns out not to be taken.
Found a good CPU bug like that on an old AMD chip, the K6-II.
It happened with a floating point divide by zero in the x87 registers, guarded by a test for division by zero, with floating-point traps enabled. The divide got speculatively executed, the trap was stored, the test
revealed the divide would be by zero, the CPU tried to clean up, hit its
bug, and just stopped. Power switch time.
This only happened with the reverse divide instruction, which took the operands off the x87 stack in the opposite order from the usual FDIV. It
was rarely used, so the bug didn't become widely known. But Microsoft's compiler used it occasionally.
On Mon, 23 Feb 2026 08:06:20 GMT
[email protected] (Anton Ertl) wrote:
The recent comparisons of branchy vs branchless binary search that we
carried on RWT forum seems to suggest that on modern CPUs branchless
variant is faster even when the table does not fit in LLC.=20
Only two explanations come to my mind:
1) The M3 has a hardware prefetcher that recognizes the pattern of a
binary array search and prefetches accordingly. The cache misses from
page table accesses might confuse the prefetcher, leading to worse
performance eventually.
Coffee Lake certainly has no such prefetcher and nevertheless exhibits >similar behavior.
I tried to run your code
<https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
on Zen4, but clang-14 converts uut2.c into branchy code. I could not
get gcc-12 to produce branchless code from slightly adapted source
code, either. My own attempt of using extended asm did not pass your
sanity checks, so eventually I used the assembly code produced by
clang-19 through godbolt.
I suppose that you have good reason for avoiding installation of clang17
or later on one of your computers.
If you have recommendations what to use for the other parameters, I
can run other sizes as well.
- anton
Run for every size from 100K to 2G in increments of x sqrt(2).
BTW, I prefer odd number of iterations.
The point at which majority of look-ups misses L3$ at least once
hopefully will be seen as change of slop on log(N) vs duration
graph for branchless variant.
I would not bother with performance counters. At least for me they
bring more confusion than insite.
According to my understanding, on Zen4 the ratio of main DRAM
latency to L3 latency is much higher than on either Coffee Lake or M3,
both of which have unified LLC instead of split L3$.
So, if on Zen2 branchy starts to win at ~2x L3 size, I will not be
shocked. But I will be somewhat surprised.
I actually have access to Zen3-based EPYC, where the above mentioned
ratio is supposed much bigger than on any competent client CPU (Intel's >Lunar/Arrow Lake do not belong to this category) but this server is
currently powered down and it's a bit of a hassle to turn it on.
On Mon, 23 Feb 2026 08:06:20 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
On Sun, 22 Feb 2026 13:17:30 GMT
[email protected] (Anton Ertl) wrote:
MitchAlsup <[email protected]d> writes:
Array and Matrix scientific codes with datasets bigger than=20
cache. =20
The dense cases are covered by stride-based hardware predictors, so
they are not "otherwise". I am not familiar enough with sparse
scientific codes to comment on whether they are 1), 2), or
"otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive
even without hardware predictors.
There are HPC applications that are bandwidth limited; that's why they
have the roofline performance model (e.g.,
<https://docs.nersc.gov/tools/performance/roofline/>).
Sure. But that's not what I would call "dense".
In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's no
longer true.
I very much doubt that. There is little point in adding an
instruction that slows down execution by turning it from compute-bound
to memory-bound.
It does not slow down the execution. To the contrary, it speeds it up
so much that speed of handling of L2 misses begins to matter.
Pay attention that it's just my speculation that can be wrong.
IIRC, right now binary64-capable AMX is available only on Apple Silicon
(via SME) and may be on IBM Z. I didn't play with either.
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth >bound.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table), >with a (say 300 byte) record for each of its many thousands of employees >each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales" >department. With no index on "department", but it is at a fixed >displacement within each record, the code looks at each record, does a >trivial test on it, perhaps adds to a register, then goes to the next >record. This it almost certainly memory latency bound.
So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
beyond that coming from the bandwidth limits (if any).
Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
but is memory bandwidth limited.
Michael S <[email protected]> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
[email protected] (Anton Ertl) wrote:
The recent comparisons of branchy vs branchless binary search
that we carried on RWT forum seems to suggest that on modern CPUs
branchless variant is faster even when the table does not fit in
LLC.=20
Only two explanations come to my mind:
1) The M3 has a hardware prefetcher that recognizes the pattern of
a binary array search and prefetches accordingly. The cache
misses from page table accesses might confuse the prefetcher,
leading to worse performance eventually.
Coffee Lake certainly has no such prefetcher and nevertheless
exhibits similar behavior.
I have now looked more closely, and the npoints parameter has a
significant influence. If it is small enough, branchless fits into
caches while branchy does not (see below), which might be one
explanation for the results. Given that I have not seen npoints
specified in any of the postings that mention branchless winning for
veclen sizes exceeding the L3 cache size, it could be this effect.
Or
maybe some other effect. Without proper parameters one cannot
reproduce the experiment to investigate it.
I tried to run your code
<https://www.realworldtech.com/forum/?threadid=223776&curpostid=223955>
on Zen4, but clang-14 converts uut2.c into branchy code. I could
not get gcc-12 to produce branchless code from slightly adapted
source code, either. My own attempt of using extended asm did not
pass your sanity checks, so eventually I used the assembly code
produced by clang-19 through godbolt.
I suppose that you have good reason for avoiding installation of
clang17 or later on one of your computers.
Sure. It costs time.
If you have recommendations what to use for the other parameters, I
can run other sizes as well.
- anton
Run for every size from 100K to 2G in increments of x sqrt(2).
There is no integer n where 2G=100k*sqrt(2)^n. So I used the numbers
shown below. You did not give any indication on npoints, so I
investigated myself, and found that branchless will miss the L3 cache
with npoints=100000, so I used that and used reps=200.
BTW, I prefer odd number of iterations.
You mean odd reps? Why?
Anyway, here are the usec/point numbers:
Zen4 8700G Tiger Lake 1135G7
veclen branchless branchy branchless branchy
100000 0.030945 0.063620 0.038220 0.080542
140000 0.031035 0.068244 0.034315 0.084896
200000 0.038302 0.073819 0.045602 0.089972
280000 0.037056 0.079651 0.042108 0.096271
400000 0.046685 0.081895 0.055457 0.104561
560000 0.043028 0.088356 0.055095 0.113646
800000 0.051180 0.092201 0.074570 0.123403
1120000 0.048806 0.096621 0.088121 0.142758
1600000 0.060206 0.101069 0.131099 0.172171
2240000 0.073547 0.115428 0.167353 0.205602
3200000 0.094561 0.139996 0.208903 0.234939
4500000 0.121049 0.162757 0.244457 0.268286
6400000 0.152417 0.178611 0.292024 0.295204
9000000 0.189134 0.192100 0.320426 0.327127
12800000 0.219408 0.208083 0.372084 0.353530
18000000 0.237684 0.222140 0.418645 0.389785
25000000 0.270798 0.236786 0.462689 0.415937
35000000 0.296994 0.254001 0.526235 0.451467
50000000 0.330582 0.268768 0.599331 0.478667
70000000 0.356788 0.288526 0.622659 0.522092
100000000 0.388326 0.305980 0.698470 0.562841
140000000 0.407774 0.321496 0.737884 0.609814
200000000 0.442434 0.336242 0.848403 0.654830
280000000 0.455125 0.356382 0.902886 0.729970
400000000 0.496894 0.372735 1.120986 0.777920
560000000 0.520664 0.393827 1.173606 0.855461
800000000 0.544343 0.412087 1.759271 0.901011
1100000000 0.584389 0.431854 1.862866 0.965724
1600000000 0.614764 0.455844 2.046371 1.027111
2000000000 0.622513 0.467445 2.149251 1.089775
So branchy surpasses branchless at veclen=12.8M on both machines, for npoints=100k.
Concerning the influence of npoints, I have worked with veclen=20M in
the following.
1) If <npoints> is small, for branchy the branch predictor will
learn the pattern on the first repetition, and predict correctly in
the following repetitions; on Zen4 and Tiger Lake I see the
following percentage of branch predictions (for
<npoints>*<rep>=20_000_000, <veclen>=20_000_000):
npoints Zen4 Tiger Lake
250 0.03% 0.51%
500 0.03% 10.40%
1000 0.03% 13.51%
2000 0.07% 13.84%
4000 13.45% 12.61%
8000 15.26% 12.31%
16000 15.56% 12.26%
32000 15.60% 12.23%
80000 15.59% 12.24%
Tiger Lake counts slightly more branches than Zen4 (for the same
binary): 1746M vs. 1703M, but there is also a real lower number of
mispredictions on Tiger Lake for high npoints; at npoints=80000:
214M on Tiger Lake vs. 266M on Zen4. My guess is that in Zen4 the
mispredicts of the deciding branch interfere with the prediction of
the loop branches, and that the anti-interference measures on Tiger
Lake results in the branch predictor being less effective at
npoints=500..2000.
2) If <npoints> is small all the actually accessed array elements
will fit into some cache. Also, even if npoints and veclen are
large enough that they do not all fit, with a smaller <npoints> a
larger part of the accesses will happen to a cache, and a larger
part to a lower-level cache. With the same parameters as above,
branchy sees on a Ryzen 8700G (Zen4 with 16MB L3 cache) the
following numbers of ls_any_fills_from_sys.all_dram_io
(LLC-load-misses and l3_lookup_state.l3_miss result in <not
supported> on this machine), and on a Core i5-1135G (Tiger Lake
supported> with
8MB L3) the following number of LLC-load-misses:
branchy branchless
8700G 1135G7 8700G 1135G7
npoints fills LLC-load-misses fills LLC-load-misses
250 1_156_206 227_274 1_133_672 39_212
500 1_189_372 19_820_264 1_125_836 47_994
1000 1_170_829 125_727_181 1_130_941 96_516
2000 1_310_015 279_063_572 1_173_501 299_297
4000 73_528_665 452_147_169 1_151_042 5_661_917
8000 195_883_759 501_433_404 1_248_638 58_877_208
16000 301_559_180 511_420_040 2_688_530 101_811_222
32000 389_512_147 511_713_759 29_799_206 116_312_019
80000 402_131_449 512_460_513 91_276_752 118_762_341
The 16MB L3 of the 8700G cache has 262_144 cache lines, the 8MB L3
of the 1135G7 has 131_072 cache lines. How come branchy has so many
cache misses already at relatively low npoints? Each search
performs ~25 architectural memory accesses; in addition, in branchy
we see a number of misspeculated memory accesses for each
architectural access, resulting in the additional memory accesses.
For branchless the 8700G sees a ramp-up of memory accesses more than
the factor of 2 later compared to the 1135G7 that the differences in
cache sizes would suggest. The 1135G7 cache is divided into 4 2MB
slices, and accesses are assigned by physical address (i.e., the L3
cache does not function like an 8MB cache with a higher
associativity), but given the random-access nature of this workload,
I would expect the accesses to be distributed evenly across the
slices, with little bad effects from this cache organization.
Given the memory access numbers of branchy and branchless above, I
expected to see a speedup of branchy in those cases where branchless
has a lot of L3 misses, so I decided to use npoints=100k and rep=200
in the experiments further up. And my expectations turned out to be
right, at least on these two machines.
- anton
Stephen Fuld <[email protected]d> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth
bound.
If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can
help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table),
with a (say 300 byte) record for each of its many thousands of employees
each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
Michael S <[email protected]> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
On Sun, 22 Feb 2026 13:17:30 GMT
[email protected] (Anton Ertl) wrote:
MitchAlsup <[email protected]d> writes:
Array and Matrix scientific codes with datasets bigger than=20
cache. =20
The dense cases are covered by stride-based hardware
predictors, so they are not "otherwise". I am not familiar
enough with sparse scientific codes to comment on whether they
are 1), 2), or "otherwise".
=20
BLAS Level 3 is not particularly external/LLC bandwidth intensive
even without hardware predictors.
There are HPC applications that are bandwidth limited; that's why
they have the roofline performance model (e.g.,
<https://docs.nersc.gov/tools/performance/roofline/>).
Sure. But that's not what I would call "dense".
In my vocabulary "dense" starts at matmul(200x200,x200) or at LU >decomposition of matrix of similar dimensions.
"Dense matrices" vs. "sparse matrices", not in terms of FLOPS/memory
access.
So adding two dense matrices tends to be memory bandwidth bound, but stride-based prefetchers help to avoid getting any extra latency
beyond that coming from the bandwidth limits (if any).
Likewise, John McCalpin's Stream benchmark uses dense vectors IIRC,
but is memory bandwidth limited.
Overwhelming majority of data served from
L2 cache.
That's with classic SIMD. It's possible that with AMX units it's
no longer true.
I very much doubt that. There is little point in adding an
instruction that slows down execution by turning it from
compute-bound to memory-bound.
It does not slow down the execution. To the contrary, it speeds it up
so much that speed of handling of L2 misses begins to matter.
Pay attention that it's just my speculation that can be wrong.
IIRC, right now binary64-capable AMX is available only on Apple
Silicon (via SME) and may be on IBM Z. I didn't play with either.
AMX is an Intel extension of AMD64 (and IA-32); ARM's SME also has
"matrix" in its name, but is not AMX.
Looking at
<https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions>, it seems
to me that the point of AMX is to deal with small matrices (16x64
times 64x16 for Int8, 16x32 times 32x16 for 16-bit types) of small
elements (INT8, BF16, FP16 and complex FP16 numbers) in a special
unit. Apparently the AMX unit in Granite Rapids consumes 2048 bytes
in 16 cycles, i.e., 128 bytes per cycle and produces 256 or 512 bytes
in these 16 cycles. If every of these matrix multiplication happens
happens on its own, the result will certainly be bandwidth-bound to L2
and maybe already to L1. If, OTOH, these operations are part of a
larger matrix multiplication, then cache blocking can probably lower
the bandwidth to L2 enough, and reusing one of the operands in
registers can lower the bandwidth to L1 enough.
In any case, Intel will certainly not add hardware that exceeds the
bandwidth boundaries in all common use cases.
- anton
On Tue, 24 Feb 2026 09:50:43 GMT[...]
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
[email protected] (Anton Ertl) wrote:
I think that it was said more than once throughout the thread that all >measurements were taken with npoints=1M and rep=11.
And for small rep
even-size median has non-negligible bias.
Intuitively, I don't like measurements with so smal npoints parameter,
but objectively they are unlikely to be much different from 1M points.
On Tiger Lake ratio is smaller than on others, probably because of
Intel's slow "mobile" memory controller
So far I don't see how your measurements disprove my theory of page
table not fitting in L2 cache.
If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
your code, online. Do I have your permission to do so?
- anton
BGB <[email protected]> posted:
On 2/21/2026 2:15 PM, MitchAlsup wrote:Whether your ISA can be attacked with Spectré and/or Meltdown;
BGB <[email protected]> posted:
On 2/20/2026 5:49 PM, MitchAlsup wrote:
----------------------------
There is a non-zero risk though when one disallows uses that are
theoretically allowed in the ISA, even if GCC doesn't use them.
This is why one must decode all 32-bits of each instruction--so that
there is no hole in the decoder that would allow the core to do some-
thing not directly specified in ISA. {And one of the things that make
an industrial quality ISA so hard to fully specify.}}
---------------------
Sometimes there is a tension:
What is theoretically allowed in the ISA;
What is the theoretically expected behavior in some abstract model;
What stuff is actually used by compilers;
What features or behaviors does one want;
...
Whether your DFAM can be attacked with RowHammer;
Whether your call/return interface can be attacked with:
{ Return Orienter Programmeing, Buffer Overflows, ...}
That is; whether you care if your system provides a decently robust programming environment.
I happen to care. Apparently, most do not.
Implementing RISC-V strictly as per an abstract model would limit both
efficiency and hinder some use-cases.
One can make an argument that it is GOOD to limit attack vectors, and
provide a system that is robust in the face of attacks.
Then it comes down to "what do compilers do" and "what unintentional
behaviors could an ASM programmer stumble onto unintentionally".
Nïeve at best.
Let me better explain what I was trying to set up, then you can tell me >where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few >instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
On Tue, 24 Feb 2026 10:46:32 GMT
[email protected] (Anton Ertl) wrote:
The question is not whether bandwidth from L2 to AMX is sufficient. It
is.
The interesting part is what bandwidth from LLC *into* L2 is needed in >scenario of multiplication of big matrices.
Supposedly 1/3rd of L2 is used for matrix A that stays here for very
long time, another 1/3rd holds matrix C that remains here for, say, 8 >iterations and remaining 1/3rd is matrix B that should be reloaded on
every iteration. AMX increases the frequency at which we have to reload
B and C.
Assuming L2 = 2MB and square matrices , N = sqrt(2MB/8B/3) = 295,
rounded down to 288. Each step takes 288**3= 24M FMA operations. If
future AMX does 128 FMA per clock then the iteration take 188K clocks.
288**2 * 8B / 188K = 3.5 B/clock. That's easy for 1 AMX/L2 combo, but
very hard when you have 64 units like those competing on the same
LLC.
Michael S <[email protected]> writes:
On Tue, 24 Feb 2026 09:50:43 GMT[...]
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
On Mon, 23 Feb 2026 08:06:20 GMT
[email protected] (Anton Ertl) wrote:
I think that it was said more than once throughout the thread that
all measurements were taken with npoints=1M and rep=11.
I obviously missed it, that's why I asked for the parameters earlier.
And for small rep
even-size median has non-negligible bias.
Even-size median is the arithmetic mean between the (n-1)/2-lowest and
the (n+1)/2-lowest result. What bias do you have in mind?
Here are the results with npoints=1M reps=11:
Zen4 8700G Tiger Lake 1135G7
veclen branchless branchy branchless branchy
100000 0.030187 0.063023 0.038614 0.081332
140000 0.029779 0.066112 0.034797 0.085259
200000 0.037532 0.072898 0.045945 0.090583
280000 0.036677 0.078795 0.042678 0.097386
400000 0.045777 0.084639 0.055256 0.104249
560000 0.043092 0.086424 0.055677 0.112252
800000 0.050325 0.091714 0.078172 0.123927
1120000 0.048855 0.095756 0.091508 0.141819
1600000 0.061101 0.099660 0.133560 0.168403
2240000 0.082970 0.113055 0.166363 0.205129
3200000 0.120834 0.137753 0.211253 0.235155
4500000 0.149028 0.160359 0.241716 0.268885
6400000 0.178675 0.177070 0.294336 0.294931
9000000 0.219919 0.195292 0.322921 0.326113
12800000 0.240604 0.209397 0.380076 0.352707
18000000 0.257400 0.224952 0.409813 0.388834
25000000 0.293160 0.239217 0.472999 0.413559
35000000 0.307939 0.257587 0.522223 0.450754
50000000 0.346399 0.271420 0.595265 0.478713
70000000 0.375525 0.286462 0.626350 0.516533
100000000 0.401462 0.305444 0.699097 0.555928
140000000 0.419590 0.322480 0.732551 0.611815
200000000 0.451003 0.337880 0.854034 0.647795
280000000 0.468193 0.359133 0.914159 0.729234
400000000 0.506300 0.372219 1.150605 0.773266
560000000 0.520346 0.390277 1.167567 0.851592
800000000 0.556292 0.412035 1.799293 0.899768
1100000000 0.572654 0.432598 1.861136 0.963649
1600000000 0.618735 0.450526 2.062255 1.026560
2000000000 0.629088 0.465219 2.132035 1.073728
Intuitively, I don't like measurements with so smal npoints
parameter, but objectively they are unlikely to be much different
from 1M points.
It's pretty similar, but the first point where branchy is faster is veclen=6.4M for the 8700G and still 12.8M for the 1135G7; on the
1135G7 the 6.4M and 9M veclens are pretty close for both npoints.
My current theory why the crossover is so late is twofold:
1) branchy performs a lot of misspeculated accesses, which reduce the
cache hit rate (and increasing the cache miss rate) already at
relatively low npoints (as shown in <[email protected]>), and likely also for
low veclen with high npoints. E.g., already npoints=4000 has a >2M
fills from memory on the 8700G, and npoints=500 has >2M
LLC-load-misses at 1135G7, whereas the corresponding numbers for
branchless are npoints=16000 and npoints=4000. This means that
branchy does not just suffer from the misprediction penalty, but also
from fewer cache hits for the middle stages of the search. How strong
this effect is depends on the memory subsystem of the CPU core.
2) In the last levels of the larger searches the better memory-level parallelism of branchy leads to branchy catching up and eventually
overtaking branchless, but it has to first compensate for the slowness
in the earlier levels before it can reach crossover. That's why we
see the crossover point at significantly larger sizes than L3.
On Tiger Lake ratio is smaller than on others, probably because of
Intel's slow "mobile" memory controller
This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
separate DIMM, so the memory controller may be faultless.
So far I don't see how your measurements disprove my theory of page
table not fitting in L2 cache.
I did not try to disprove that; if I did, I would try to use huge
pages (the 1G ones if possible) and see how that changes the results.
But somehow I fail to see why the page table walks should make much of
a difference.
If I find the time, I would like to see how branchless with software prefetching performs. And I would like to put all of that, including
your code, online. Do I have your permission to do so?
- anton
On Tue, 24 Feb 2026 17:30:31 GMT
[email protected] (Anton Ertl) wrote:
This particular machine has 8GB DDR4 soldered in plus 32GB DDR4 on a
separate DIMM, so the memory controller may be faultless.
Do you mean that dealing with two different types of memory is
objectively hard job?
Or that latency of 1135G7 is not to bad?
It all looks lake memory latency on this TGL is ~90ns
If I find the time, I would like to see how branchless with software
prefetching performs. And I would like to put all of that, including
your code, online. Do I have your permission to do so?
- anton
Does not SW prefetch turn itself into NOP on TLB miss?
On 2/18/26 3:45 PM, BGB wrote:
I do not know if 5/6-bit state machines have been academically
examined for predictor entries. I suspect the extra storage is a
significant discouragement given one often wants to cover more
different correlations and branches.
If the 5/6-bit FSM can fit more patterns than 3x 2-bit saturating
counters, it can be a win.
I suspect it very much depends on whether bias or pattern is
dominant. This would depend on the workload (Doom?) and the
table size (and history length). I do not know that anyone in
academia has explored this, so I think you should be proud of
your discovery even if it has limited application.
A larger table (longer history) can mean longer training, but
such also discovers more patterns and longer patterns (e.g.,
predicting a fixed loop count). However, correlation strength
tends to decrease with increasing distance (having multiple
history lengths and hashings helps to find the right history).
As noted, the 5/6 bit FSM can predict arbitrary 4 bit patterns.
When the pattern is exactly repeated this is great, but if the
correlation with global history is fuzzy (but biased) a counter
might be better.
I get the impression that branch prediction is complicated
enough that even experts only have a gist of what is actually
happening, i.e., there is a lot of craft and experimentation and
less logical derivation (I sense).
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:
Does imply that my younger self was notable, and not seen as just
some otherwise worthless nerd.
Educators who are any good notice the weird kids who are actually smart.
For 128 predicate registers, this part doesn't make as much sense:
I suspect they wanted to re-use some logic.
The tricks Itanium could do with combinations of predicate registers were pretty weird. There was at least one instruction for manipulating them
which I was entirely unable to understand, with the manual in front of me
and pencil and paper to try examples. Fortunately, it never occurred in
code generated by any of the compilers I used.
*1: Where people argue that if each vendor can do a CPU with their
own custom ISA variants and without needing to license or get
approval from a central authority, that invariably everything would
decay into an incoherent mess where there is no binary
compatibility between processors from different vendors (usual
implication being that people are then better off staying within
the ARM ecosystem to avoid RV's lawlessness).
The importance of binary compatibility is very much dependent on the
market sector you're addressing. It's absolutely vital for consumer apps
and games. It's much less important for current "AI" where each vendor
has their own software stack anyway. RISC-V seems to be far more
interested in the latter at present.
Stephen Fuld <[email protected]d> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth
bound.
If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can
help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table),
with a (say 300 byte) record for each of its many thousands of employees
each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell meI think that it's bandwidth-bound, because none of the memory (or
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically >increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, >doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance.
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level parallelism of the hardware. If the records are in RAM, the hardware prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance. To me, that is the definition
of being latency bound, not bandwidth bound.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster
or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
On 2/24/2026 5:25 AM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to
mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth >> bound.
If a problem is bandwidth-bound, then differences between conventional architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table), >> with a (say 300 byte) record for each of its many thousands of employees >> each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
FWIW:
IME, code with fairly random access patterns to memory, and lots of
cache misses, is inherently slow; even on big/fancy OoO chips.
Seemingly about the only real hope the CPU has is to have a large cache and just
hope that the data happens to be in the cache (and has been accessed previously or sufficiently recently) else it is just kinda SOL.
If there is some way that CPU's can guess what memory they need in
advance and fetch it beforehand, I have not seen much evidence of this personally.
Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.
Like, for example, decoding a PNG like format vs a JPG like format:
PNG decoding typically processes the image as several major phases:
Decompress the Deflate-compressed buffer into memory;
Walk over the image, running scanline filters,
copying scanlines into a new (output) buffer.
On 2/22/2026 3:52 PM, John Dallman wrote:
In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:
Does imply that my younger self was notable, and not seen as just
some otherwise worthless nerd.
Educators who are any good notice the weird kids who are actually smart.
Sometimes I question if I really am though.
Like, some evidence says I am, but by most metrics of "life success" I
have done rather poorly.
And, in middle and high-school, they just sorta forced me to sit through normal classes (which sucked really hard)
Well, and I apparently missed
the point of school, thinking it was more of an endurance thing with
sort of a vague pretense of education (and I probably would have learned more if they just let me spend the time doing whatever else).
The tricks Itanium could do with combinations of predicate registers were pretty weird. There was at least one instruction for manipulating them which I was entirely unable to understand, with the manual in front of me and pencil and paper to try examples. Fortunately, it never occurred in code generated by any of the compilers I used.
Possibly.
I had also looked into a more limited set of predicate registers at one point, but this fizzled in favor of just using GPRs.
So, as noted:
I have 1 predicate bit (T bit);
Had looked into expanding it to 2 predicate bits (using an S bit as a
second predicate), but this went nowhere.
With matrixes of just the right size, one can achieve a TLB miss on
every 8th access.
And, in middle and high-school, they just sorta forced me to sit through
normal classes (which sucked really hard)
In my case, I remember sitting in the back of advanced algebra class
(mostly senior HS people, me a sophomore) doing chemistry homework while >vaguely listening to the teacher fail to get various students to solve
a typical algebra problem. Then she called on me, I looked up at the board >and in less than a second I rattled off the answer skipping 5 steps along
the way. Moral, don't be bored in class, do something useful instead.
Well, and I apparently missed
the point of school, thinking it was more of an endurance thing with
sort of a vague pretense of education (and I probably would have learned
more if they just let me spend the time doing whatever else).
For most people, school attempts to give the students just enough knowledge >that they are not burdens on society.
MitchAlsup <[email protected]d> writes:
And, in middle and high-school, they just sorta forced me to sit through >>> normal classes (which sucked really hard)
In my case, I remember sitting in the back of advanced algebra class
(mostly senior HS people, me a sophomore) doing chemistry homework while
vaguely listening to the teacher fail to get various students to solve
a typical algebra problem. Then she called on me, I looked up at the board >> and in less than a second I rattled off the answer skipping 5 steps along
the way. Moral, don't be bored in class, do something useful instead.
Well, and I apparently missed >>> the point of school, thinking it was more of an endurance thing with
sort of a vague pretense of education (and I probably would have learned >>> more if they just let me spend the time doing whatever else).
For most people, school attempts to give the students just enough knowledge >> that they are not burdens on society.
My high school (1970s, when the split was K-7, 7-9, 10-12) had
four "communities".
Traditional
Career
Work Study
Flexible Individual Learning (FIL)
The college-bound were generally part of the
FIL community. Career included business classes,
traditional was more like the olden days and
Work Study included off-school apprenticeships,
shop classes, electronics training, etc.
Students mostly took classes with peers in their
community (there were over 400 in my graduating class).
Worked rather well, but ended up segregating students
by income level as well as IQ, so
the school district changed that in the
80s in the interest of equality treating the
entire high school as a single community. The
quality of the education received diminished
thereafter, IMO.
BGB <[email protected]> posted:
On 2/22/2026 3:52 PM, John Dallman wrote:
In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote: >>>
Does imply that my younger self was notable, and not seen as just
some otherwise worthless nerd.
Educators who are any good notice the weird kids who are actually smart. >>>
Sometimes I question if I really am though.
Like, some evidence says I am, but by most metrics of "life success" I
have done rather poorly.
And, in middle and high-school, they just sorta forced me to sit through
normal classes (which sucked really hard)
In my case, I remember sitting in the back of advanced algebra class
(mostly senior HS people, me a sophomore) doing chemistry homework while vaguely listening to the teacher fail to get various students to solve
a typical algebra problem. Then she called on me, I looked up at the board and in less than a second I rattled off the answer skipping 5 steps along
the way. Moral, don't be bored in class, do something useful instead.
Well, and I apparently missed
the point of school, thinking it was more of an endurance thing with
sort of a vague pretense of education (and I probably would have learned
more if they just let me spend the time doing whatever else).
For most people, school attempts to give the students just enough knowledge that they are not burdens on society.
-------------------------
The tricks Itanium could do with combinations of predicate registers were >>> pretty weird. There was at least one instruction for manipulating them
which I was entirely unable to understand, with the manual in front of me >>> and pencil and paper to try examples. Fortunately, it never occurred in
code generated by any of the compilers I used.
It could have been a case where the obvious logic decoding "that" field in the instruction allowed for "a certain pattern" to perform what they described
in the spec. I did some of this in Mc 88100, and this is what taught me never to do it again or allow anyone else to do it again.
Possibly.
I had also looked into a more limited set of predicate registers at one
point, but this fizzled in favor of just using GPRs.
So, as noted:
I have 1 predicate bit (T bit);
Had looked into expanding it to 2 predicate bits (using an S bit as a
second predicate), but this went nowhere.
I have tried several organizations over the last 40 years of practice::
In my Humble and Honest Opinion, the only constructs predicates should support are singular comparisons and comparisons using && and || with deMoganizing logic {~}--not because other forms are unuseful, but be-
cause those are the constructs programmers use writing code.
On 2/22/2026 3:52 PM, John Dallman wrote:
In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote:
Does imply that my younger self was notable, and not seen as just
some otherwise worthless nerd.
Educators who are any good notice the weird kids who are actually smart.
Sometimes I question if I really am though.
Like, some evidence says I am, but by most metrics of "life success" I
have done rather poorly.
And, in middle and high-school, they just sorta forced me to sit through normal classes (which sucked really hard). Well, and I apparently missed
the point of school, thinking it was more of an endurance thing with
sort of a vague pretense of education (and I probably would have learned more if they just let me spend the time doing whatever else).
...
But, it seems like a case of:
By implication, I am smart, because if I wasn't, even my own (sometimes pointless) hobby interests would have been out of reach.
Like, not a world of difficulty justifying them, or debating whether or
not something is worth doing, but likely not something someone could do
at all.
Or, maybe, like encountering things that seem confusing isn't such a
rare experience (or that people have learned how to deal more
productively with things they can see but don't understand?...).
But, there is a thing I have noted:
I had a few times mentioned to people about finding that certain AIs had gotten smart enough to start understanding how a 5/6 bit finite state machine to predict repeating 1-4 bit patterns would be constructed.
Then, I try to describe it, and then realize that for the people I try
to mention it to, it isn't that they have difficulty imagining how one
would go about filling in the table and getting all of the 4 bit
patterns to fit into 32 possible states. Many seem to have difficulty understanding how such a finite state machine would operate in the first place.
Even if, it seems like this part seems like something that pretty much anyone should be able to understand.
Initially, I had used this as a test case for the AIs because it posed "moderate difficulty" for problems which could be reasonably completely described in a chat prompt (and is not overly generic).
Nevermind if it is still a pain to generate tables by hand, and my
attempts at hand-generated tables have tended to have worse adaptation
rates than those generated using genetic algorithms (can be more clean looking, but tend to need more input bits to reach the target state if
the pattern changes).
Sometimes I feel like a poser.
Other things, it seems, I had taken for granted.
Seems sometimes if I were "actually smart", would have figured out some
way to make better and more efficient use of my span of existence.
On 2/24/2026 5:25 AM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to >>>> mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and bandwidth >>> bound.
If a problem is bandwidth-bound, then differences between conventional>> architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can>> help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file (table), >>> with a (say 300 byte) record for each of its many thousands of employees >>> each containing typical employee stuff). Now suppose someone wants to
know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed>>> displacement within each record, the code looks at each record, does a
trivial test on it, perhaps adds to a register, then goes to the next>>> record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming>> language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem>> into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2>> or L3 cache of modern processors.
FWIW:
IME, code with fairly random access patterns to memory, and lots of
cache misses, is inherently slow; even on big/fancy OoO chips. Seemingly about the only real hope the CPU has is to have a large cache and just > hope that the data happens to be in the cache (and has been accessed
previously or sufficiently recently) else it is just kinda SOL.
If there is some way that CPU's can guess what memory they need in
advance and fetch it beforehand, I have not seen much evidence of this > personally.
Rather, as can be noted, memory access patterns can often make a fairly large impact on the performance of some algorithms.
Like, for example, decoding a PNG like format vs a JPG like format:Could you have a secondary thread that started as soon as one (or a
PNG decoding typically processes the image as several major phases:
Decompress the Deflate-compressed buffer into memory;
Walk over the image, running scanline filters,
copying scanlines into a new (output) buffer.
Even if the parts, taken in isolation, should be fast:This reminds me of CABAC decoding in h264, where the output of the
The image buffers are frequently too large to fit in cache;
Cache misses tend to make PNG decoding painfully slow,
even when using faster filters.
If using the Paeth filter though, this adds extra slowness,
due to branch-predictor misses.
On targets like x86,
the filter is frequently implemented using branches;
The branch miss rate is very high.
So, a naive branching version, performs like dog crap.
So, net result: Despite its conceptual simplicity, PNG's decode-time performance typically sucks.
Contrast, a decoder for a JPEG like format can be made to process one
block at a time and go all the way to final output. So, JPEG is often
faster despite the more complex process (with transform stages and a colorspace transform).
The Paeth filter slowness does seem a little odd though:I would look for a way to handle multiple pixels at once, with SIMD
Theoretically, a CPU could turn a short forward branch into predication;
But, this doesn't tend to be the case.
It then is faster to turn the filter into some convoluted mess of
arithmetic and masking in an attempt to reduce the branch mispredict costs.
BGB <[email protected]> posted:
On 2/22/2026 3:52 PM, John Dallman wrote:
In article <10nak0a$nrac$[email protected]>, [email protected] (BGB) wrote: >>>
Does imply that my younger self was notable, and not seen as just
some otherwise worthless nerd.
Educators who are any good notice the weird kids who are actually smart. >>>
Sometimes I question if I really am though.
Like, some evidence says I am, but by most metrics of "life success" I
have done rather poorly.
And, in middle and high-school, they just sorta forced me to sit through
normal classes (which sucked really hard)
In my case, I remember sitting in the back of advanced algebra class
(mostly senior HS people, me a sophomore) doing chemistry homework while vaguely listening to the teacher fail to get various students to solve
a typical algebra problem. Then she called on me, I looked up at the board and in less than a second I rattled off the answer skipping 5 steps along
the way. Moral, don't be bored in class, do something useful instead.
MitchAlsup wrote:
In my case, I remember sitting in the back of advanced algebra class
(mostly senior HS people, me a sophomore) doing chemistry homework while
vaguely listening to the teacher fail to get various students to solve
a typical algebra problem. Then she called on me, I looked up at the board >> and in less than a second I rattled off the answer skipping 5 steps along
the way. Moral, don't be bored in class, do something useful instead.
I used a double physics time slot (i.e two 50-min time slots with a
5-min break between them) in exactly the same way, except that I
calculated ~24 digits of pi using the taylor series for atan(1/5) and >atan(1/239). The latter part was much faster of course!
Doing long divisions by 25 and (n^2+n) took the majority of the time.
Terje
PS. I re-implemented the exact same algorithm, using base 1e10, on the
very first computer I got access to, a Univac 110x in University. This
was my first ever personal piece of programming.
Stephen Fuld <[email protected]d> writes:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
I mentioned several limits. Which one do you have in mind?
the limit of memory-level parallelism of the hardware.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically
increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand,
doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance.
If the CPU is designed to provide enough memory-level parallelism to
make use of the bandwidth (and that is likely, otherwise why provide
that much bandwidth), then once the designers spend money on
increasing the bandwidth, they will also spend the money necessary to increase the MLP.
Concerning a reduction in latency, that would not
increase performance, because this application is already working at
the bandwidth limit.
I feel the urge to write up a mock variant of your use case and
measure whether reality confirms my expectations, but currently have
no time for that.
But let's take a slightly simpler case for which I already have a
program:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell meI think that it's bandwidth-bound, because none of the memory (or
where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound. >>
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited. Think of it this way. In the current situation, increasing the memory system bandwidth, say by hypothetically increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance. To me, that is the definition
of being latency bound, not bandwidth bound.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster
or spin the disk faster. You want higher bandwidth? Put more bits on a track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
BGB wrote:
On 2/24/2026 5:25 AM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
On 2/21/2026 8:18 AM, Anton Ertl wrote:
big snip
Otherwise what kind of common code do we have that is
memory-dominated? Tree searching and binary search in arrays come to >>>>> mind, but are they really common, apart from programming classes?
It is probably useful to distinguish between latency bound and
bandwidth
bound.
If a problem is bandwidth-bound, then differences between conventional
architectures and EPIC play no role, and microarchitectural
differences in the core play no role, either; they all have to wait
for memory.
For latency various forms of prefetching (by hardware or software) can
help.
Many occur in commercial (i.e. non scientific) programs, such as
database systems. For example, imagine a company employee file
(table),
with a (say 300 byte) record for each of its many thousands of
employees
each containing typical employee stuff). Now suppose someone wants to >>>> know "What is the total salary of all the employees in the "Sales"
department. With no index on "department", but it is at a fixed
displacement within each record, the code looks at each record, does a >>>> trivial test on it, perhaps adds to a register, then goes to the next
record. This it almost certainly memory latency bound.
If the records are stored sequentially, either because the programming
language supports that arrangement and the programmer made use of
that, or because the allocation happened in a way that resulted in
such an arrangement, stride-based prefetching will prefetch the
accessed fields and reduce the latency to the one due to bandwidth
limits.
If the records are stored randomly, but are pointed to by an array,
one can prefetch the relevant fields easily, again turning the problem
into a latency-bound problem. If, OTOH, the records are stored
randomly and are in a linked list, this problem is a case of
pointer-chasing and is indeed latency-bound.
BTW, thousands of employee records, each with 300 bytes, fit in the L2
or L3 cache of modern processors.
FWIW:
IME, code with fairly random access patterns to memory, and lots of
cache misses, is inherently slow; even on big/fancy OoO chips.
Seemingly about the only real hope the CPU has is to have a large
cache and just hope that the data happens to be in the cache (and has
been accessed previously or sufficiently recently) else it is just
kinda SOL.
If there is some way that CPU's can guess what memory they need in
advance and fetch it beforehand, I have not seen much evidence of this
personally.
Rather, as can be noted, memory access patterns can often make a
fairly large impact on the performance of some algorithms.
Like, for example, decoding a PNG like format vs a JPG like format:
PNG decoding typically processes the image as several major phases:
Decompress the Deflate-compressed buffer into memory;
Walk over the image, running scanline filters,
copying scanlines into a new (output) buffer.
Could you have a secondary thread that started as soon as one (or a
small number of) scanline(s) were available, taking advantage of any
shared $L3 cache to grab the data before it is blown away?
Even if the parts, taken in isolation, should be fast:
The image buffers are frequently too large to fit in cache;
Cache misses tend to make PNG decoding painfully slow,
even when using faster filters.
If using the Paeth filter though, this adds extra slowness,
due to branch-predictor misses.
On targets like x86,
the filter is frequently implemented using branches;
The branch miss rate is very high.
So, a naive branching version, performs like dog crap.
This reminds me of CABAC decoding in h264, where the output of the arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
So, net result: Despite its conceptual simplicity, PNG's decode-time
performance typically sucks.
Contrast, a decoder for a JPEG like format can be made to process one
block at a time and go all the way to final output. So, JPEG is often
faster despite the more complex process (with transform stages and a
colorspace transform).
The Paeth filter slowness does seem a little odd though:
Theoretically, a CPU could turn a short forward branch into predication;
But, this doesn't tend to be the case.
It then is faster to turn the filter into some convoluted mess of
arithmetic and masking in an attempt to reduce the branch mispredict
costs.
I would look for a way to handle multiple pixels at once, with SIMD
code: There the masking/combining is typically the easiest way to
implement short branches.
(I might take a look a png decoding at some point)
Terje
My first was a simple BASIC "hello world" program in 1974 on a
Burroughs B5500 (remotely, via again an ASR-33) which we had
for a week in 7th grade math class.
I was quite proud when I managed to factorize 123456789, which
took some time.
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to > make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one > of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise often suffers from cache misses (cycles lost to cache misses can outweigh the
cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
10/11: Too short, limits effectiveness.
12: OK, leans faster;
13: OK, leans better compression;
14: Intermediate
15: Slower still (Deflate is here)
16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a singleI have looked at multi-level table lookups, where the symbol either is
big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and > slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
In this case, the relative slowness of falling back to a linear search > being less than that of the cost of the L1 misses from a bigger lookup > table.Yeah.
The relative loss of Huffman coding efficiency between a 13 bit limit
and 15 bit limit is fairly modest.
I would look for a way to handle multiple pixels at once, with SIMD
code: There the masking/combining is typically the easiest way to
implement short branches.
(I might take a look a png decoding at some point)
#if 0 //naive version, pays a lot for branch penaltiesOK
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
int p, pa, pb, pc;
p=a+b-c;
pa=(p>a)?(p-a):(a-p);
pb=(p>b)?(p-b):(b-p);
pc=(p>c)?(p-c):(c-p);
p=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
return(p);
}
#endif
#if 1 //avoid branch penalties
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
int p, pa, pb, pc;
int ma, mb, mc;
p=a+b-c;
pa=p-a; pb=p-b; pc=p-c;
ma=pa>>31; mb=pb>>31; mc=pc>>31;
pa=pa^ma; pb=pb^mb; pc=pc^mc;
ma=pb-pa; mb=pc-pb; mc=pc-pa;
ma=ma>>31; mb=mb>>31; mc=mc>>31;
p=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
return(p);
}
#endif
Where, the Paeth filter is typically the most heavily used filter in PNG decoding (because it tends to be the most accurate), but also the slowest.
Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.
Scott Lurndal <[email protected]> schrieb:
My first was a simple BASIC "hello world" program in 1974 on a
Burroughs B5500 (remotely, via again an ASR-33) which we had
for a week in 7th grade math class.
I started out on my father's first programmable pocket calculator,
a Casio model with 38 steps (I think).
I was quite proud when I managed to factorize 123456789, which
took some time.
On 01/03/2026 13:18, Thomas Koenig wrote:
Scott Lurndal <[email protected]> schrieb:
My first was a simple BASIC "hello world" program in 1974 on a
Burroughs B5500 (remotely, via again an ASR-33) which we had
for a week in 7th grade math class.
I started out on my father's first programmable pocket calculator,
a Casio model with 38 steps (I think).
Would that have been a Casio fx-3600P ? I bought one of these as a teenager, and used it non-stop. 38 steps of program space was not a
lot, but I remember making a library for complex number calculations for it.
I was quite proud when I managed to factorize 123456789, which
took some time.
I used mine to find formulas for numerical integration (like Simpson's
rule, but higher order). Basically useless, but fun!
Stefan Monnier <[email protected]> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software
pipelining)), which in turn is based on the work on compilers for
microcode.
That did not solve memory latency, but that's a problem even for OoO
cores.
I suspect a big part of the problem was tension between Intel and HP
were the only political solution was allowing the architects from both
sides to "dump in" their favorite ideas. A recipe for disaster.
The HP side had people like Bob Rau (Cydrome) and Josh Fisher
(Multiflow), and given their premise, the architecture is ok; somewhat
on the complex side, but they wanted to cover all the good ideas from
earlier designs; after all, it was to be the one architecture to rule
them all (especially performancewise). You cannot leave out a feature
that a competitor could then add to outperform IA-64.
The major problem was that the premise was wrong. They assumed that
in-order would give them a clock rate edge, but that was not the case,
right from the start (The 1GHz Itanium II (released July 2002)
competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
XP (released June 2002)). They also assumed that explicit parallelism
would provide at least as much ILP as hardware scheduling of OoO CPUs,
but that was not the case for general-purpose code, and in any case,
they needed a lot of additional ILP to make up for their clock speed >disadvantage.
On 2/27/2026 1:52 AM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so >>> the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits >>> of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
I mentioned several limits. Which one do you have in mind?
The one you mentioned in your last paragraph, specifically,
the limit of memory-level parallelism of the hardware.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically >> increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, >> doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance.
If the CPU is designed to provide enough memory-level parallelism to
make use of the bandwidth (and that is likely, otherwise why provide
that much bandwidth), then once the designers spend money on
increasing the bandwidth, they will also spend the money necessary to increase the MLP.
No. The memory system throughput depends upon the access pattern. It
is easier/lower cost to increase the throughput for sequential accesses
than random (think wider interfaces, cache blocks larger than the amount accessed, etc.)
But optimization for sequential workloads can actually
hurt performance for random workloads, e.g. larger block sizes reduce
the number of accesses for sequential workloads, but each access takes longer, thus hurting random workloads. So
designers aim to maximize the throughput, subject to cost and technology constraints, for some mix of sequential (bandwidth) versus random (latency) access.
BGB wrote:
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to
make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one
of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise
often suffers from cache misses (cycles lost to cache misses can
outweigh the cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
10/11: Too short, limits effectiveness.
12: OK, leans faster;
13: OK, leans better compression;
14: Intermediate
15: Slower still (Deflate is here)
16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a
single big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and
slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
I have looked at multi-level table lookups, where the symbol either is
the one you want (short codes) or an index into a list of secondary
tables to be used on the remaining bits.
When you have many really short codes (think Morse!) , then you can profitably decode multiple in a single iteration.
In this case, the relative slowness of falling back to a linear search
being less than that of the cost of the L1 misses from a bigger lookup
table.
The relative loss of Huffman coding efficiency between a 13 bit limit
and 15 bit limit is fairly modest.
Yeah.
I would look for a way to handle multiple pixels at once, with SIMD
code: There the masking/combining is typically the easiest way to
implement short branches.
(I might take a look a png decoding at some point)
#if 0 //naive version, pays a lot for branch penalties
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
int p, pa, pb, pc;
p=a+b-c;
pa=(p>a)?(p-a):(a-p);
pb=(p>b)?(p-b):(b-p);
pc=(p>c)?(p-c):(c-p);
p=(pa<=pb)?((pa<=pc)?a:c):((pb<=pc)?b:c);
return(p);
}
#endif
#if 1 //avoid branch penalties
int BGBBTJ_BufPNG_Paeth(int a, int b, int c)
{
int p, pa, pb, pc;
int ma, mb, mc;
p=a+b-c;
pa=p-a; pb=p-b; pc=p-c;
ma=pa>>31; mb=pb>>31; mc=pc>>31;
pa=pa^ma; pb=pb^mb; pc=pc^mc;
ma=pb-pa; mb=pc-pb; mc=pc-pa;
ma=ma>>31; mb=mb>>31; mc=mc>>31;
p=(ma&((mb&c)|((~mb)&b))) | ((~ma)&((mc&c)|((~mc)&a)));
return(p);
}
#endif
Where, the Paeth filter is typically the most heavily used filter in
PNG decoding (because it tends to be the most accurate), but also the
slowest.
Could in theory be SIMD'ed to maybe work on RGB or RGBA in parallel.
OK
On 3/1/2026 12:02 PM, Terje Mathisen wrote:
BGB wrote:
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to
make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one
of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise
often suffers from cache misses (cycles lost to cache misses can
outweigh the cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
10/11: Too short, limits effectiveness.
12: OK, leans faster;
13: OK, leans better compression;
14: Intermediate
15: Slower still (Deflate is here)
16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a
single big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and
slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
I have looked at multi-level table lookups, where the symbol either is
the one you want (short codes) or an index into a list of secondary
tables to be used on the remaining bits.
Can work OK if one assumes all of the longer codes are prefixed by a
longish series of 1s, which is usually but not necessarily true.
Stephen Fuld <[email protected]d> wrote:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me >>>> where I went wrong. I did expect the records to be sequential, and
could be pre-fetched, but with the inner loop so short, just a few
instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a
small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so
the loads can be started right away, up to the limit of memory-level
parallelism of the hardware. If the records are in RAM, the hardware
prefetcher can help to avoid running into the scheduler and ROB limits
of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited. Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically
increasing the number of memory banks, having a wider interface between
the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand,
doing things to reduce the memory latency (say hypothetically a faster
ram cell), would improve the performance. To me, that is the definition
of being latency bound, not bandwidth bound.
I agree with your definition, but my prediction is somewhat different.
First, consider silly program that goes sequentially over larger
array accessing all lines. AFAICS it you should see tiny effect when
program uses only one byte from each line compared to using whole
line. Now consider variant that accesses every fifth line.
There are differences, one that prefetcher needs to realize that
there is no need to prefetch intermediate lines. Second difference
is that one can fetch lines quickly only when they are on a single
page. Having "step 5" on lines means 5 times as many page crossings.
I do not know how big are pages in modern DRAM, but at step large
enough you will see significant delay due to page crossing. I
would tend to call this delay "latency", but it is somewhat
murky. Namely, with enough prefetch and enough memory banks
you can still saturate a single channel to the core (assuming
that there are many cores, many channels from memory controller
to memory banks but only single channel between memory controller
and each core. Of course, modern system tend to have limited
number of memory banks, so the argument above is purely theoretical.
Somewhat different case is when there are independent loads from
random locations, something like
for(i = 0; i < N; i++) {
s += m[f(i)];
}
where 'f' is very cheap to compute, but hard to predict by the
hardware. In case above reorder buffer and multiple banks helps,
but even with unlimited CPU resurces maximal number of accesses
is number of memory banks divided by access time of single bank
(that is essentialy latency of memory array).
Then there is pointer chasing case, like
for(i = 0; i < N; i++) {
j = m[j];
}
when 'm' is filled with semi-random cyclic pattern this behaves
quite badly, basically you can start next access only when you
have result of the previous access. In practice, large 'm'
seem to produce large number of cache misses for TLB entries.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster
or spin the disk faster. You want higher bandwidth? Put more bits on a
track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
That disk analogy is flaved. AFAIK there is no penalty for choosing
"far away" pages compared to "near" ones (if any the opposite: Row
Hammer shows that accesses to "near" pages mean that given page may
need more frequent refresh). In case of memory, time spent in
memory controller is non-negligible, at least for accesses within
single page. AFAIK random access to lines withing single page
costs no more than sequential access, for disk you want sequential
access to a single track.
On 2/28/2026 1:49 PM, Waldek Hebisch wrote:----------------------
Stephen Fuld <[email protected]d> wrote:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Somewhat different case is when there are independent loads from
random locations, something like
for(i = 0; i < N; i++) {
s += m[f(i)];
}
where 'f' is very cheap to compute, but hard to predict by the
hardware. In case above reorder buffer and multiple banks helps,
but even with unlimited CPU resurces maximal number of accesses
is number of memory banks divided by access time of single bank
(that is essentialy latency of memory array).
Then there is pointer chasing case, like
for(i = 0; i < N; i++) {
j = m[j];
}
when 'm' is filled with semi-random cyclic pattern this behaves
quite badly, basically you can start next access only when you
have result of the previous access. In practice, large 'm'
seem to produce large number of cache misses for TLB entries.
Well, the examples I gave got confusing (my fault) because, as Anton
pointed out, the table I used would fit into L3 cache on many modern systems. So this all got tied up in the difference between in cache
results and in DRAM results. I don't disagree with your points, but
they are tangential to the point I was trying to make.
Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then added the value at that addressed byte to an accumulator. Repeat say
10,000 times. I would call this program latency bound, but I suspect
Anton would call it bandwidth bound. If that is true, then that
explains the original differences Anton and I had.
Perhaps this distinction is clearer to me due to my background in the
(hard) disk business. You want lower latency? Make the arm move faster >> or spin the disk faster. You want higher bandwidth? Put more bits on a
track or interleave the data across multiple disk heads. And in a
system, the number of active prefetches is naturally limited by the
number of disk arms you have.
That disk analogy is flaved. AFAIK there is no penalty for choosing
"far away" pages compared to "near" ones (if any the opposite: Row
Hammer shows that accesses to "near" pages mean that given page may
need more frequent refresh). In case of memory, time spent in
memory controller is non-negligible, at least for accesses within
single page. AFAIK random access to lines withing single page
costs no more than sequential access, for disk you want sequential
access to a single track.
Yes, but these are second order compared to the difference between
latency and bandwidth on a disk.
This whole think has spiraled into something far beyond what I expected,
and what, I think, is useful. So unless you want me to, I probably
won't respond further.
BGB <[email protected]> posted:
On 3/1/2026 12:02 PM, Terje Mathisen wrote:
BGB wrote:
On 2/28/2026 9:48 AM, Terje Mathisen wrote:
This reminds me of CABAC decoding in h264, where the output of the
arithmetic decoder is single bits that by definition cannot be
predictable, but the codec typically uses that bit to branch.
Yeah.
Making arithmetic and range coders fast is also hard.
I don't often use them as much because I am not aware of a good way to >>>> make them fast.
This is part of why I had often ended up going for STF+AdRice or
similar, which, while not the best in terms of compression, can be one >>>> of the faster options in many cases.
Theoretically, table-driven Huffman could be faster, but likewise
often suffers from cache misses (cycles lost to cache misses can
outweigh the cost of the more complex logic of an AdRice coder).
Huffman speed can be improved by reducing maximum symbol length and
table size, but then this can lose much its compression advantage.
Say, max symbol length:
10/11: Too short, limits effectiveness.
12: OK, leans faster;
13: OK, leans better compression;
14: Intermediate
15: Slower still (Deflate is here)
16: Slower (T.81 JPEG is here)
Where, for 12/13 bits, the fastest strategy is typically to use a
single big lookup table for the entropy decoder.
For 15 or 16 bits, it is often faster to have a separate fast-path and >>>> slow path. Say, fast path matches on the first 9 or 10 bits, and then
the slow path falls back to a linear search (over the longer symbols).
I have looked at multi-level table lookups, where the symbol either is
the one you want (short codes) or an index into a list of secondary
tables to be used on the remaining bits.
Can work OK if one assumes all of the longer codes are prefixed by a
longish series of 1s, which is usually but not necessarily true.
In HW any pattern can be used. In SW only patterns that are almost
satisfied by the current ISA can be considered. Big difference.
In article <[email protected]>,--- Synchronet 3.21d-Linux NewsLink 1.2
Anton Ertl <[email protected]> wrote:
Stefan Monnier <[email protected]> writes:
At the time of conception, there were amny arguments that {sooner or
later} compilers COULD figure stuff like this out.
I can't remember seeing such arguments comping from compiler people, tho.
Actually, the IA-64 people could point to the work on VLIW (in
particular, Multiflow (trace scheduling) and Cydrome (software >>pipelining)), which in turn is based on the work on compilers for >>microcode.
That did not solve memory latency, but that's a problem even for OoO
cores.
I suspect a big part of the problem was tension between Intel and HP
were the only political solution was allowing the architects from both >>>> sides to "dump in" their favorite ideas. A recipe for disaster.
The HP side had people like Bob Rau (Cydrome) and Josh Fisher
(Multiflow), and given their premise, the architecture is ok; somewhat
on the complex side, but they wanted to cover all the good ideas from >>earlier designs; after all, it was to be the one architecture to rule
them all (especially performancewise). You cannot leave out a feature
that a competitor could then add to outperform IA-64.
The major problem was that the premise was wrong. They assumed that >>in-order would give them a clock rate edge, but that was not the case, >>right from the start (The 1GHz Itanium II (released July 2002)
competed with 2.53GHz Pentium 4 (released May 2002) and 1800MHz Athlon
XP (released June 2002)). They also assumed that explicit parallelism >>would provide at least as much ILP as hardware scheduling of OoO CPUs,
but that was not the case for general-purpose code, and in any case,
they needed a lot of additional ILP to make up for their clock speed >>disadvantage.
As I've said before: I worked at HP during IA64, and it was not driven
by technical issues, but rather political/financial issues.
On HP's side, IA64 was driven by HP Labs, which was an independent group doing technical investigations without any clear line to products. They
had to "sell" their ideas to the HP development groups, who could ignore them.
They managed to get some upper level HP managers interested in IA64,
and took that directly to Intel. The HP internal development groups (the ones making CPUs and server/workstation chipsets) did almost nothing with IA64 until after Intel announced the IA64 agreement.
IA64 was called PrecisionArchitecture-WideWord (PA-WW) by HP Labs as a
follow on to PA-RISC. The initial version of PA-WW had no register interlocks whatsoever, code had to be written to know the L1 and L2
cache latency, and not touch the result registers too soon. This was
laughed out of the room, and they came back with interlocks in the next iteration. This happened in 1993-1994, which was before the Out-of-Order RISCs came to market (but they were in development in HP and Intel), so the IA64 decisions were being made in the time window before folks really got to see what OoO could do.
Also on HP's side, we had our own fab, which was having trouble keeping up with the rest of the industry. Designers felt performance was not predictable, and the fab's costs were escalating. The fab was going to
have trouble getting to 180nm and beyond. So HP wanted access to Intel's fabs, and that was part of the IA64 deal--we could make PA-RISC chips on Intel's fabs for a long time.
On Intel's side, Intel was divided very strongly geographically. At the time, Hillsboro was "winning" in the x86 CPU area, and Santa Clara was
on the outs (I think they did 860 and other failures like that). So
when Santa Clara heard of IA64, they jumped on the opportunity--a way to
leap past Hillsboro. IA64 solved the AMD problem--with all new IA64
patents, AMD couldn't clone it like x86, so management was interested. Technically, IA64 just had to be "as good as" x86, to make it worth
while to jump to a new architecture which removes their competitor. I
can see how even smart folks could get sucked in to thinking
"architecture doesn't matter, and this new one prevents clones, so we
should do it to eventually make more money".
Both companies had selfish mid-level managers who saw a way to pad their resumes to leap to VP of engineering almost anywhere else. And they were right--on HP's side, I think every manager involved moved to a promotion
at another company just before Merced came out. So IA64 was not going to
get canceled--the managers didn't want to admit they were wrong.
Both companies also saw IA64 as a way to kill off the RISC competitors.
And on this point, they were right, IA64 did kill the RISC minicomputer market.
The technical merits of IA64 don't make the top 5 in the list of reasons to do IA64 for either company.
But HP using Intel's fabs didn't work out well. HP's first CPU on
Intel's fabs was the 360MHz PA-8500. This was a disappointing step up
from the 240MHz PA-8200 (which was partly speed limited by external L1
cache memory, running at 4ns, and the 8500 moved to on-chip L1 cache
memory, removing that limit). It turned out Intel's fab advantage was consistency and yield, not speed, and so it would take tuning to get the speed up. Intel did this tuning with large teams, and this was not easy
for HP to replicate. And by this time, IBM was marketing a 180nm copper
wire SOI process which WAS much faster (and yields weren't a concern for
HP), so after getting the PA-8500 up to 550MHz after a lot of work, HP
jumped to IBM as a fab, and the speeds went up to 750MHz and then 875MHz with some light tuning (and a lot less work).
Everyone technically minded knew IA64 was technically not that great, but both companies had their reasons to do it anyway.
Kent
Stephen Fuld <[email protected]d> posted:
On 2/27/2026 1:52 AM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me >>>>>> where I went wrong. I did expect the records to be sequential, and >>>>>> could be pre-fetched, but with the inner loop so short, just a few >>>>>> instructions, I thought that it would quickly "get ahead" of the
prefetch. That is, that there was a small limit on the number of
prefetches that could be in process simultaneously, and with such a >>>>>> small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so >>>>> the loads can be started right away, up to the limit of memory-level >>>>> parallelism of the hardware. If the records are in RAM, the hardware >>>>> prefetcher can help to avoid running into the scheduler and ROB limits >>>>> of the OoO engine.
I think our difference may be just terminology rather than substance.
To me, it is precisely the limit you mentioned that makes it latency
rather than bandwidth limited.
I mentioned several limits. Which one do you have in mind?
The one you mentioned in your last paragraph, specifically,
the limit of memory-level parallelism of the hardware.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically >>>> increasing the number of memory banks, having a wider interface between >>>> the memory and the core, etc., all traditional methods for increasing
memory bandwidth, would not improve the performance. On the other hand, >>>> doing things to reduce the memory latency (say hypothetically a faster >>>> ram cell), would improve the performance.
If the CPU is designed to provide enough memory-level parallelism to
make use of the bandwidth (and that is likely, otherwise why provide
that much bandwidth), then once the designers spend money on
increasing the bandwidth, they will also spend the money necessary to
increase the MLP.
No. The memory system throughput depends upon the access pattern. It
is easier/lower cost to increase the throughput for sequential accesses
than random (think wider interfaces, cache blocks larger than the amount
accessed, etc.)
It depends on the number of accesses and the ability to absorb the latency.
For example, say we have a memory system with 256 banks, and a latency of 10µs, and each access is for a page of memory at 5GB/s.
A page (4096B) at 5GB/s needs 800ns±
So, we need 12.5 Banks to saturate the data channel
And we have 16 busses each with 16 banks,
we can sustain 80GB/s
So, we need in excess of 200 outstanding requests
Each able to absorb slightly more than 10µs.
But this is more like a disk/flash system than main memory.
Once each bank (of which there are 256) has more than a 3-deep queue
the BW can be delivered as long as the requests have almost ANY order
other than targeting 1 (or few) bank(s).
But what you say is TRUE when one limits the interpretation of modern
CPUs, but not when one limits themselves to applications running on
modern CPUs requesting pages from long term storage. {how do you think
Data Based work??}
But optimization for sequential workloads can actually
hurt performance for random workloads, e.g. larger block sizes reduce
the number of accesses for sequential workloads, but each access takes
longer, thus hurting random workloads. So
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
On 3/1/2026 1:13 PM, MitchAlsup wrote:
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk >controller, making it the industry's first true cache disk controller.
This was almost all about reducing latency (from tens of milliseconds on--- Synchronet 3.21d-Linux NewsLink 1.2
a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
Stephen Fuld <[email protected]d> writes:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk
controller, making it the industry's first true cache disk controller.
Let me guess: Your employer was purchased about five years
later by a former Secretary of the Treasury :-).
FWIW, about that same time, there were third-party
RAM-based disk units available for the the systems
that many of the big-B customers were using. Not inexpensive,
but performed well (if still limited by disk controller
and host I/O bus bandwidth (1MB/s and 8MB/s respectively
on the medium systems line).
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 2/27/2026 1:52 AM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
On 2/24/2026 11:33 PM, Anton Ertl wrote:
Stephen Fuld <[email protected]d> writes:
Let me better explain what I was trying to set up, then you can tell me
where I went wrong. I did expect the records to be sequential, and >>>>>> could be pre-fetched, but with the inner loop so short, just a few >>>>>> instructions, I thought that it would quickly "get ahead" of the >>>>>> prefetch. That is, that there was a small limit on the number of >>>>>> prefetches that could be in process simultaneously, and with such a >>>>>> small CPU loop, it would quickly hit that limit, and thus be latency bound.
I think that it's bandwidth-bound, because none of the memory (or
outer-level cache) accesses depend on the results of previous ones; so >>>>> the loads can be started right away, up to the limit of memory-level >>>>> parallelism of the hardware. If the records are in RAM, the hardware >>>>> prefetcher can help to avoid running into the scheduler and ROB limits >>>>> of the OoO engine.
I think our difference may be just terminology rather than substance. >>>> To me, it is precisely the limit you mentioned that makes it latency >>>> rather than bandwidth limited.
I mentioned several limits. Which one do you have in mind?
The one you mentioned in your last paragraph, specifically,
the limit of memory-level parallelism of the hardware.
Think of it this way. In the current
situation, increasing the memory system bandwidth, say by hypothetically >>>> increasing the number of memory banks, having a wider interface between >>>> the memory and the core, etc., all traditional methods for increasing >>>> memory bandwidth, would not improve the performance. On the other hand, >>>> doing things to reduce the memory latency (say hypothetically a faster >>>> ram cell), would improve the performance.
If the CPU is designed to provide enough memory-level parallelism to
make use of the bandwidth (and that is likely, otherwise why provide
that much bandwidth), then once the designers spend money on
increasing the bandwidth, they will also spend the money necessary to
increase the MLP.
No. The memory system throughput depends upon the access pattern. It
is easier/lower cost to increase the throughput for sequential accesses
than random (think wider interfaces, cache blocks larger than the amount >> accessed, etc.)
It depends on the number of accesses and the ability to absorb the latency.
For example, say we have a memory system with 256 banks, and a latency of 10µs, and each access is for a page of memory at 5GB/s.
A page (4096B) at 5GB/s needs 800ns±
So, we need 12.5 Banks to saturate the data channel
And we have 16 busses each with 16 banks,
we can sustain 80GB/s
So, we need in excess of 200 outstanding requests
Each able to absorb slightly more than 10µs.
But this is more like a disk/flash system than main memory.
Yes. In particular, requests to main memory are typically the size of a cache block, not a page. That changes the calculations above.
Once each bank (of which there are 256) has more than a 3-deep queue
the BW can be delivered as long as the requests have almost ANY order
other than targeting 1 (or few) bank(s).
So you are saying that the system is bandwidth limited as long as the
CPU can sustain 768 (3 * 256) simultaneous prefetches in progress. OK :-)
But what you say is TRUE when one limits the interpretation of modern
CPUs, but not when one limits themselves to applications running on
modern CPUs requesting pages from long term storage. {how do you think
Data Based work??}
OK. BTW, I have done database "stuff" since CODASYL database systems in
the 1970s through relational systems in the 1980s. But page sized
accesses to external storage wasn't what we were talking about.
But optimization for sequential workloads can actually
hurt performance for random workloads, e.g. larger block sizes reduce
the number of accesses for sequential workloads, but each access takes
longer, thus hurting random workloads. So
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk controller, making it the industry's first true cache disk controller.
This was almost all about reducing latency (from tens of milliseconds on
a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
Stephen Fuld <[email protected]d> posted:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk
controller, making it the industry's first true cache disk controller.
This was almost all about reducing latency (from tens of milliseconds on
a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
There is an SSD that can perform 3,300,000×4096B random read transfers
per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.
Technically, IA64 just had to be "as good as" x86, to make it worth
while to jump to a new architecture which removes their competitor.
I can see how even smart folks could get sucked in to thinking
"architecture doesn't matter, and this new one prevents clones, so
we should do it to eventually make more money".
Everyone technically minded knew IA64 was technically not that
great, but both companies had their reasons to do it anyway.
On 3/3/2026 11:01 AM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
snip
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk
controller, making it the industry's first true cache disk controller.
This was almost all about reducing latency (from tens of milliseconds on >>> a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
There is an SSD that can perform 3,300,000×4096B random read transfers
per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.
Wow! But I think you will agree that design is unlikely to be used for >"mass market" hundreds of terabyte systems used for commercial database >systems, etc.
Let me try again. Suppose you had a (totally silly) program with a 2 GB array, and you used a random number generate an address within it, then
added the value at that addressed byte to an accumulator. Repeat say 10,000 times. I would call this program latency bound, but I suspect Anton would call it bandwidth bound. If that is true, then that explains the original differences Anton and I had.
Let me try again. Suppose you had a (totally silly) program with a 2 GB
array, and you used a random number generate an address within it, then
added the value at that addressed byte to an accumulator. Repeat say 10,000 >> times. I would call this program latency bound, but I suspect Anton would >> call it bandwidth bound. If that is true, then that explains the original >> differences Anton and I had.
I think in theory, this is not latency bound: assuming enough CPU and
memory parallelism in the implementation, it can be arbitrarily fast.
But in practice it will probably be significantly slower than if you
were to do a sequential traversal.
Indeed, in practice you may sometimes see the performance be correlated
with your memory latency, but if so it's only because your hardware
doesn't offer enough parallelism (e.g. not enough memory banks).
AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help speed
it up (typically because of pointer-chasing). This is important,
because it's a *lot* more difficult to reduce memory latency than it is
to add bandwidth or parallelism.
Stephen Fuld <[email protected]d> writes:
On 3/3/2026 11:01 AM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
snip
Wow! But I think you will agree that design is unlikely to be used forcpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk
controller, making it the industry's first true cache disk controller. >>>> This was almost all about reducing latency (from tens of milliseconds on >>>> a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
There is an SSD that can perform 3,300,000×4096B random read transfers
per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay. >>
"mass market" hundreds of terabyte systems used for commercial database
systems, etc.
I think you'll find that the commercial databases are dominated
by high-end NVMe (PCI based SSDs) for working storage, with
spinning rust as archive storage.
For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
https://techatlantix.com/mzwmo61thclf-00aw7.html
On 3/3/2026 1:55 PM, Scott Lurndal wrote:
Stephen Fuld <[email protected]d> writes:
On 3/3/2026 11:01 AM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
snip
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk >>>>> controller, making it the industry's first true cache disk controller. >>>>> This was almost all about reducing latency (from tens of milliseconds on >>>>> a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
There is an SSD that can perform 3,300,000×4096B random read transfers >>>> per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe >>>> link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.
Wow! But I think you will agree that design is unlikely to be used for
"mass market" hundreds of terabyte systems used for commercial database
systems, etc.
I think you'll find that the commercial databases are dominated
by high-end NVMe (PCI based SSDs) for working storage, with
spinning rust as archive storage.
For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
https://techatlantix.com/mzwmo61thclf-00aw7.html
I freely admit that I am "out of the loop" for modern systems. However, >this system costs about $125/TB. A quick check of Amazon shows typical
hard disk prices at about $25 /TB. Are you saying that typical current >systems are paying about 5 times the price, and way greater than the
cost of the CPU for such systems? While obviously there is a market for >such systems, it is hard for me to believe that the typical "enterprise" >customer would be that market. Amazing!
Let me try again. Suppose you had a (totally silly) program with a 2 GB
array, and you used a random number generate an address within it, then
added the value at that addressed byte to an accumulator. Repeat say 10,000 >> times. I would call this program latency bound, but I suspect Anton would >> call it bandwidth bound. If that is true, then that explains the original >> differences Anton and I had.
I think in theory, this is not latency bound: assuming enough CPU and
memory parallelism in the implementation, it can be arbitrarily fast.
But in practice it will probably be significantly slower than if you
were to do a sequential traversal.
Indeed, in practice you may sometimes see the performance be correlated
with your memory latency, but if so it's only because your hardware
doesn't offer enough parallelism (e.g. not enough memory banks).
AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help speed
it up (typically because of pointer-chasing). This is important,
because it's a *lot* more difficult to reduce memory latency than it is
to add bandwidth or parallelism.
On 3/3/2026 1:55 PM, Scott Lurndal wrote:That's not a fare comparison.
Stephen Fuld <[email protected]d> writes:
On 3/3/2026 11:01 AM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
snip
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's
disk controller, making it the industry's first true cache disk
controller. This was almost all about reducing latency (from
tens of milliseconds on a non cache controller to hundreds of
microseconds on a cache hit). There was a small improvement in
transfer rate, but the latency reduction dominated the
improvement.
There is an SSD that can perform 3,300,000�4096B random read
transfers per second on a PCIe 5.0-�4 connector. That is 13.2GB/s
over the PCIe link which is BW limited to 15.x GB/s. Each "RAS"
has a 70�s access delay.
Wow! But I think you will agree that design is unlikely to be
used for "mass market" hundreds of terabyte systems used for
commercial database systems, etc.
I think you'll find that the commercial databases are dominated
by high-end NVMe (PCI based SSDs) for working storage, with
spinning rust as archive storage.
For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
https://techatlantix.com/mzwmo61thclf-00aw7.html
I freely admit that I am "out of the loop" for modern systems.
However, this system costs about $125/TB. A quick check of Amazon
shows typical hard disk prices at about $25 /TB.
Are you saying that
typical current systems are paying about 5 times the price, and way
greater than the cost of the CPU for such systems? While obviously
there is a market for such systems, it is hard for me to believe that
the typical "enterprise" customer would be that market. Amazing!
Stefan Monnier wrote:
Let me try again. Suppose you had a (totally silly) program with
a 2 GB array, and you used a random number generate an address
within it, then added the value at that addressed byte to an
accumulator. Repeat say 10,000 times. I would call this program
latency bound, but I suspect Anton would call it bandwidth bound.
If that is true, then that explains the original differences Anton
and I had.
I think in theory, this is not latency bound: assuming enough CPU
and memory parallelism in the implementation, it can be arbitrarily
fast. But in practice it will probably be significantly slower than
if you were to do a sequential traversal.
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses might
be very significant unless you've setup huge pages.
Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
will be horrible!
Indeed, in practice you may sometimes see the performance be
correlated with your memory latency, but if so it's only because
your hardware doesn't offer enough parallelism (e.g. not enough
memory banks).
AFAIK, when people say "latency-bound" they usually mean that adding parallelism and/or bandwidth to your memory hierarchy won't help
speed it up (typically because of pointer-chasing). This is
important, because it's a *lot* more difficult to reduce memory
latency than it is to add bandwidth or parallelism.
When the working set does not allow any cache re-use, then a classic
Cray could perform much better than a modern OoO cpu.
Terje
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
Stefan Monnier wrote:
Let me try again. Suppose you had a (totally silly) program with
a 2 GB array, and you used a random number generate an address
within it, then added the value at that addressed byte to an
accumulator. Repeat say 10,000 times. I would call this program
latency bound, but I suspect Anton would call it bandwidth bound.
If that is true, then that explains the original differences Anton
and I had.
I think in theory, this is not latency bound: assuming enough CPU
and memory parallelism in the implementation, it can be arbitrarily
fast. But in practice it will probably be significantly slower than
if you were to do a sequential traversal.
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses might
be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
will be horrible!
Indeed, in practice you may sometimes see the performance be
correlated with your memory latency, but if so it's only because
your hardware doesn't offer enough parallelism (e.g. not enough
memory banks).
AFAIK, when people say "latency-bound" they usually mean that adding
parallelism and/or bandwidth to your memory hierarchy won't help
speed it up (typically because of pointer-chasing). This is
important, because it's a *lot* more difficult to reduce memory
latency than it is to add bandwidth or parallelism.
When the working set does not allow any cache re-use, then a classic
Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not fit
in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does something
useful and has no cache hits at all. At very least, there will be
reuse on instruction side. But I think that in order to completely avoid reuse on the data side you'll have to do something non-realistic.
On 3/3/2026 11:01 AM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
On 3/1/2026 1:13 PM, MitchAlsup wrote:
Stephen Fuld <[email protected]d> posted:
snip
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points.
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk
controller, making it the industry's first true cache disk controller.
This was almost all about reducing latency (from tens of milliseconds on >> a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
There is an SSD that can perform 3,300,000×4096B random read transfers
per second on a PCIe 5.0-×4 connector. That is 13.2GB/s over the PCIe
link which is BW limited to 15.x GB/s. Each "RAS" has a 70µs access delay.
Wow! But I think you will agree that design is unlikely to be used for "mass market" hundreds of terabyte systems used for commercial database systems, etc. (at least for a while) Do you know what is its cost per
GB? SSDs certainly solve the multi millisecond access time of hard
disks problem, but at a high cost. I think that hard disk sales are not going away for at least a while. :-)
Michael S wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
Stefan Monnier wrote:
Let me try again. Suppose you had a (totally silly) program with
a 2 GB array, and you used a random number generate an address
within it, then added the value at that addressed byte to an
accumulator. Repeat say 10,000 times. I would call this program
latency bound, but I suspect Anton would call it bandwidth bound.
If that is true, then that explains the original differences
Anton and I had.
I think in theory, this is not latency bound: assuming enough CPU
and memory parallelism in the implementation, it can be
arbitrarily fast. But in practice it will probably be
significantly slower than if you were to do a sequential
traversal.
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses
might be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
will be horrible!
Indeed, in practice you may sometimes see the performance be
correlated with your memory latency, but if so it's only because
your hardware doesn't offer enough parallelism (e.g. not enough
memory banks).
AFAIK, when people say "latency-bound" they usually mean that
adding parallelism and/or bandwidth to your memory hierarchy
won't help speed it up (typically because of pointer-chasing).
This is important, because it's a *lot* more difficult to reduce
memory latency than it is to add bandwidth or parallelism.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not
fit in classic Cray's main memory.
The 1985 Cray-2 allowed 2GB, so theoretically possible with the
OS+program into the 73 MB gap between 2GiB and 2E9.
Easily done on a later Cray-Y-MP.
Besides, it is nearly impossible to create a code that does
something useful and has no cache hits at all. At very least, there
will be reuse on instruction side. But I think that in order to
completely avoid reuse on the data side you'll have to do something non-realistic.
I think the curent "gedankenexperiment" is way beyond "something
useful". :-)
Terje
On Wed, 4 Mar 2026 19:38:08 +0100
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
Stefan Monnier wrote:
Let me try again. Suppose you had a (totally silly) program with
a 2 GB array, and you used a random number generate an address
within it, then added the value at that addressed byte to an
accumulator. Repeat say 10,000 times. I would call this program
latency bound, but I suspect Anton would call it bandwidth bound.
If that is true, then that explains the original differences
Anton and I had.
I think in theory, this is not latency bound: assuming enough CPU
and memory parallelism in the implementation, it can be
arbitrarily fast. But in practice it will probably be
significantly slower than if you were to do a sequential
traversal.
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses
might be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
will be horrible!
Indeed, in practice you may sometimes see the performance be
correlated with your memory latency, but if so it's only because
your hardware doesn't offer enough parallelism (e.g. not enough
memory banks).
AFAIK, when people say "latency-bound" they usually mean that
adding parallelism and/or bandwidth to your memory hierarchy
won't help speed it up (typically because of pointer-chasing).
This is important, because it's a *lot* more difficult to reduce
memory latency than it is to add bandwidth or parallelism.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not
fit in classic Cray's main memory.
The 1985 Cray-2 allowed 2GB, so theoretically possible with the
OS+program into the 73 MB gap between 2GiB and 2E9.
Easily done on a later Cray-Y-MP.
I had Cray-1 in mind.
Cray-2 memory was big enough, but was it fast enough latency wise?
All info I see about Cray-2 memory praises its great capacity and
bandwidth, but tells nothing about latency.
It seems, the latency was huge and the whole system was useful only due
to mechanism that today we will call hardware prefetch. Or software
prefetch? I am not sure.
But it is possible that I misunderstood.
Besides, it is nearly impossible to create a code that does
something useful and has no cache hits at all. At very least, there
will be reuse on instruction side. But I think that in order to
completely avoid reuse on the data side you'll have to do something
non-realistic.
I think the curent "gedankenexperiment" is way beyond "something
useful". :-)
Agreed.
Starting from the latest post of Stephen Fuld (2026-03-03 03:02) we left
for good the realm of "useful".
On Wed, 4 Mar 2026 07:44:39 -0800
Stephen Fuld <[email protected]d> wrote:
On 3/3/2026 1:55 PM, Scott Lurndal wrote:
Stephen Fuld <[email protected]d> writes: =20=20
On 3/3/2026 11:01 AM, MitchAlsup wrote: =20=20
Stephen Fuld <[email protected]d> posted:
=20
On 3/1/2026 1:13 PM, MitchAlsup wrote: =20
Stephen Fuld <[email protected]d> posted: =20
snip
=20
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points. =20
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's
disk controller, making it the industry's first true cache disk
controller. This was almost all about reducing latency (from
tens of milliseconds on a non cache controller to hundreds of
microseconds on a cache hit). There was a small improvement in
transfer rate, but the latency reduction dominated the
improvement. =20
There is an SSD that can perform 3,300,000=D74096B random read
transfers per second on a PCIe 5.0-=D74 connector. That is 13.2GB/s
over the PCIe link which is BW limited to 15.x GB/s. Each "RAS"
has a 70=B5s access delay. =20
Wow! But I think you will agree that design is unlikely to be
used for "mass market" hundreds of terabyte systems used for
commercial database systems, etc. =20
I think you'll find that the commercial databases are dominated
by high-end NVMe (PCI based SSDs) for working storage, with
spinning rust as archive storage.
=20
For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
=20
https://techatlantix.com/mzwmo61thclf-00aw7.html =20
I freely admit that I am "out of the loop" for modern systems.
However, this system costs about $125/TB. A quick check of Amazon
shows typical hard disk prices at about $25 /TB.=20
That's not a fare comparison.
$25/TB you see on Amazon is likely 5400 rpm 4TB disk.
So, at 61 TB you only have 15 spindles. It means that even in ideal >conditions of accesses distributed evenly to all disks your sequential
read bandwidth is no better than ~1,200 MB/s.
For comparison, sequential read speed of BM1743 SSD is 7,500 MB/s.
In order to get comparable bandwidth with HDs you will need
95 spindles. May be, a little less with 7200 rpm. I don't know how many >spindles would you need with 13000 rpm HDs that enterprises used to use
for databases 20+ years ago. It seems, they had better latency than
7200, but about the same bandwidth. Anyway, I m not even sure that
anybody makes them still.
95 disks alone almost certainly cost more than 8KUSD. And on top of
that you will need a dozen of expensive RAID controllers.
So, essentially, when you paid 8KUSD for this SSD, you paid it for
bandwidth alone. 100x improvement in latency that you got over HD-based >solution is a free bonus. Density and power too.
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses might
be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
When the working set does not allow any cache re-use, then a classic
Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not fit
in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does something
useful and has no cache hits at all. At very least, there will be
reuse on instruction side. But I think that in order to completely avoid >reuse on the data side you'll have to do something non-realistic.
In article <[email protected]>,
Michael S <[email protected]> wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses
might be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not fit
in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does something >useful and has no cache hits at all. At very least, there will be
reuse on instruction side. But I think that in order to completely
avoid reuse on the data side you'll have to do something
non-realistic.
There is one very reasonable use case: testing a random number
generator. A useful test is to ensure numbers are uncorrelated, so
you get 3 random numbers called A, B, C, and you look up A*N*N + B*N
+ C to count the number of times you see A followed by B followed by
C, where N is the range of the random value, say, 0 - 1023. This
would be an array of 1 billion 32-bit values. You get 1000 billion
random numbers, and then look through to make sure most buckets have
a value around 1000. Any buckets less than 500 or more than 1500
might be considered a random number generator failure. This is a
useful test since it intuitively makes sense--if some patterns are
too likely (or unlikely), then you know you have a problem with your
"random" numbers.
Another use case would be an algorithm which wants to shuffle a large
array (say, you want to create test cases for a sorting algorithm). I
think most shuffling algorithms which are fair will randomly index
into the array, and each of these will be a cache miss.
Kent
On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
[email protected] (Kent Dickey) wrote:
In article <[email protected]>,
Michael S <[email protected]> wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses
might be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not fit
in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does something >useful and has no cache hits at all. At very least, there will be
reuse on instruction side. But I think that in order to completely
avoid reuse on the data side you'll have to do something
non-realistic.
There is one very reasonable use case: testing a random number
generator. A useful test is to ensure numbers are uncorrelated, so
you get 3 random numbers called A, B, C, and you look up A*N*N + B*N
+ C to count the number of times you see A followed by B followed by
C, where N is the range of the random value, say, 0 - 1023. This
would be an array of 1 billion 32-bit values. You get 1000 billion
random numbers, and then look through to make sure most buckets have
a value around 1000. Any buckets less than 500 or more than 1500
might be considered a random number generator failure. This is a
useful test since it intuitively makes sense--if some patterns are
too likely (or unlikely), then you know you have a problem with your "random" numbers.
Even if there are no cache hits in access of main histogram, there are
still cache hits in PRNG that you are testing. Unless that is very
simple PRNG completely implemented in registers.
And even in case of very simple PRNG, standard PRNG APIs keep state in memory, so in order to avoid memory accesses=cache hits one would have
to use non-standard API.
Besides, there are other reasons why modern big OoO will run rounds
around Cray, either 1 or 2, in that sort of test. Most important are
1. Compilers are not perfect.
2. Even if compilers were perfect, Cray has far fewer physical
register than Big OoO, which would inevitably lead to far fewer memory accesses running simultaneously.
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
Stefan Monnier wrote:
Let me try again. Suppose you had a (totally silly) program with
a 2 GB array, and you used a random number generate an address
within it, then added the value at that addressed byte to an
accumulator. Repeat say 10,000 times. I would call this program
latency bound, but I suspect Anton would call it bandwidth bound.
If that is true, then that explains the original differences Anton
and I had.
I think in theory, this is not latency bound: assuming enough CPU
and memory parallelism in the implementation, it can be arbitrarily
fast. But in practice it will probably be significantly slower than
if you were to do a sequential traversal.
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses might
be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
Assuming TLB+$L3+$L2+$L1 misses on every access the actual runtime
will be horrible!
Indeed, in practice you may sometimes see the performance be
correlated with your memory latency, but if so it's only because
your hardware doesn't offer enough parallelism (e.g. not enough
memory banks).
AFAIK, when people say "latency-bound" they usually mean that adding
parallelism and/or bandwidth to your memory hierarchy won't help
speed it up (typically because of pointer-chasing). This is
important, because it's a *lot* more difficult to reduce memory
latency than it is to add bandwidth or parallelism.
When the working set does not allow any cache re-use, then a classic
Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not fit
in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does something
useful and has no cache hits at all. At very least, there will be
reuse on instruction side. But I think that in order to completely avoid reuse on the data side you'll have to do something non-realistic.
Michael S <[email protected]> posted:But how many clocks would it take to build gather-scatter list that is
On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
[email protected] (Kent Dickey) wrote:
In article <[email protected]>,
Michael S <[email protected]> wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
10K selected from 2G means average distance of 200K, so you
get effectively very close to zero cache hits, and even TLB
misses might be very significant unless you've setup huge
pages.
Relatively horrible.
A human time scale it would still be very fast.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does
not fit in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does
something useful and has no cache hits at all. At very least,
there will be reuse on instruction side. But I think that in
order to completely avoid reuse on the data side you'll have to
do something non-realistic.
There is one very reasonable use case: testing a random number
generator. A useful test is to ensure numbers are uncorrelated, so
you get 3 random numbers called A, B, C, and you look up A*N*N +
B*N
+ C to count the number of times you see A followed by B followed
by C, where N is the range of the random value, say, 0 - 1023.
This would be an array of 1 billion 32-bit values. You get 1000
billion random numbers, and then look through to make sure most
buckets have a value around 1000. Any buckets less than 500 or
more than 1500 might be considered a random number generator
failure. This is a useful test since it intuitively makes
sense--if some patterns are too likely (or unlikely), then you
know you have a problem with your "random" numbers.
Even if there are no cache hits in access of main histogram, there
are still cache hits in PRNG that you are testing. Unless that is
very simple PRNG completely implemented in registers.
And even in case of very simple PRNG, standard PRNG APIs keep state
in memory, so in order to avoid memory accesses=cache hits one
would have to use non-standard API.
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated one.
The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
Besides, there are other reasons why modern big OoO will run rounds
around Cray, either 1 or 2, in that sort of test. Most important are
1. Compilers are not perfect.
2. Even if compilers were perfect, Cray has far fewer physical
register than Big OoO, which would inevitably lead to far fewer
memory accesses running simultaneously.
CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
STs of the scatter type. So, 3 instructions would create 192 memory references. And all 192 references could be 'satisfied' in 64-clocks.
I know of no GBOoO machine with 192 outstanding memory references, andI think that even if its clock rate was 5 MHz, Cray-Y/MP would still be
fewer still that can satisfy all 192 MRs in 64 clocks.
The problem was the <ahem> anemic clock rate of ~6ns. Modern CPUs are
30� faster, and "just as wide".
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated one.
The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
Stephen Fuld <[email protected]d> posted:
Perhaps that it true now, but it certainly didn't use to be. In
1979-1980 I wrote the microcode to add caching to my employer's disk
controller, making it the industry's first true cache disk controller.
This was almost all about reducing latency (from tens of milliseconds on
a non cache controller to hundreds of microseconds on a cache hit).
There was a small improvement in transfer rate, but the latency
reduction dominated the improvement.
There is an SSD that can perform 3,300,000×4096B random read transfers
per second on a PCIe 5.0-×4 connector.
Michael S <[email protected]> posted:
Besides, there are other reasons why modern big OoO will run rounds
around Cray, either 1 or 2, in that sort of test. Most important are
1. Compilers are not perfect.
2. Even if compilers were perfect, Cray has far fewer physical
register than Big OoO, which would inevitably lead to far fewer memory
accesses running simultaneously.
CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
STs of the scatter type. So, 3 instructions would create 192 memory references. And all 192 references could be 'satisfied' in 64-clocks.
I know of no GBOoO machine with 192 outstanding memory references, and
fewer still that can satisfy all 192 MRs in 64 clocks.
There are simple PRNGs that create very 'white' RNG sequences. However,
a generated RN is used to index a table of previously computed RNs,
and then swap the accessed one with the generated one. The table goes
a long way in 'whitening' the RNG.
MitchAlsup <[email protected]d> writes:
There are simple PRNGs that create very 'white' RNG sequences. However,Are you aware of any example code on Github?
a generated RN is used to index a table of previously computed RNs,
and then swap the accessed one with the generated one. The table goes
a long way in 'whitening' the RNG.
Andy Valencia [2026-03-05 08:36:24] wrote:
MitchAlsup <[email protected]d> writes:
There are simple PRNGs that create very 'white' RNG sequences. However,Are you aware of any example code on Github?
a generated RN is used to index a table of previously computed RNs,
and then swap the accessed one with the generated one. The table goes
a long way in 'whitening' the RNG.
What does this have to do with Github?
MitchAlsup <[email protected]d> writes:
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated
one. The table goes a long way in 'whitening' the RNG.
Are you aware of any example code on Github? I'd be interested in
the implementation details of a decent realization of this technique.
Thank you,
Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
On Wed, 04 Mar 2026 23:46:46 GMT
MitchAlsup <[email protected]d> wrote:
Michael S <[email protected]> posted:
On Wed, 4 Mar 2026 21:07:00 -0000 (UTC)
[email protected] (Kent Dickey) wrote:
In article <[email protected]>,
Michael S <[email protected]> wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
10K selected from 2G means average distance of 200K, so you
get effectively very close to zero cache hits, and even TLB
misses might be very significant unless you've setup huge
pages.
Relatively horrible.
A human time scale it would still be very fast.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does
not fit in classic Cray's main memory.
Besides, it is nearly impossible to create a code that does
something useful and has no cache hits at all. At very least,
there will be reuse on instruction side. But I think that in
order to completely avoid reuse on the data side you'll have to
do something non-realistic.
There is one very reasonable use case: testing a random number generator. A useful test is to ensure numbers are uncorrelated, so
you get 3 random numbers called A, B, C, and you look up A*N*N +
B*N
+ C to count the number of times you see A followed by B followed
by C, where N is the range of the random value, say, 0 - 1023.
This would be an array of 1 billion 32-bit values. You get 1000 billion random numbers, and then look through to make sure most
buckets have a value around 1000. Any buckets less than 500 or
more than 1500 might be considered a random number generator
failure. This is a useful test since it intuitively makes
sense--if some patterns are too likely (or unlikely), then you
know you have a problem with your "random" numbers.
Even if there are no cache hits in access of main histogram, there
are still cache hits in PRNG that you are testing. Unless that is
very simple PRNG completely implemented in registers.
And even in case of very simple PRNG, standard PRNG APIs keep state
in memory, so in order to avoid memory accesses=cache hits one
would have to use non-standard API.
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated one.
The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
Besides, there are other reasons why modern big OoO will run rounds around Cray, either 1 or 2, in that sort of test. Most important are
1. Compilers are not perfect.
2. Even if compilers were perfect, Cray has far fewer physical
register than Big OoO, which would inevitably lead to far fewer
memory accesses running simultaneously.
CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
STs of the scatter type. So, 3 instructions would create 192 memory references. And all 192 references could be 'satisfied' in 64-clocks.
But how many clocks would it take to build gather-scatter list that is
reused only twice ?
I know of no GBOoO machine with 192 outstanding memory references, and fewer still that can satisfy all 192 MRs in 64 clocks.
The problem was the <ahem> anemic clock rate of ~6ns. Modern CPUs are
30× faster, and "just as wide".
I think that even if its clock rate was 5 MHz, Cray-Y/MP would still be beaten by [1 core/thread of] Big OoO because of above-mentioned
bottleneck. Of course, here I assume that gather still has latency of
400ns which would be 2000 clocks on our imaginary machine.
If latency of gather cut to 200 ns then I am no longer sure of the
outcome. At 100ns Cray likely wins vs. 1 Big OoO core/thread.
In practice, if the quickness is of major importance, the task could be partially parallelized to take advantage of additional cores present in modern gear.
But doing so requires programmer's mind. Math people--- Synchronet 3.21d-Linux NewsLink 1.2
that do this type of tests rarely possess one.
MitchAlsup <[email protected]d> writes:
There are simple PRNGs that create very 'white' RNG sequences. However,
a generated RN is used to index a table of previously computed RNs,
and then swap the accessed one with the generated one. The table goes
a long way in 'whitening' the RNG.
Are you aware of any example code on Github? I'd be interested in
the implementation details of a decent realization of this technique.
Thank you,--- Synchronet 3.21d-Linux NewsLink 1.2
Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
Stefan Monnier <[email protected]> writes:
Andy Valencia [2026-03-05 08:36:24] wrote:Andy is looking for examples. The most likely place to find them
MitchAlsup <[email protected]d> writes:What does this have to do with Github?
There are simple PRNGs that create very 'white' RNG sequences. However, >>>> a generated RN is used to index a table of previously computed RNs,Are you aware of any example code on Github?
and then swap the accessed one with the generated one. The table goes
a long way in 'whitening' the RNG.
this year, would be github (rather than sourceforge et alia).
MitchAlsup wrote:
Michael S <[email protected]> posted:
Besides, there are other reasons why modern big OoO will run rounds
around Cray, either 1 or 2, in that sort of test. Most important are
1. Compilers are not perfect. 2. Even if compilers were perfect, Cray
has far fewer physical
register than Big OoO, which would inevitably lead to far fewer memory
accesses running simultaneously.
CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
STs of the scatter type. So, 3 instructions would create 192 memory
references. And all 192 references could be 'satisfied' in 64-clocks.
I know of no GBOoO machine with 192 outstanding memory references, and
fewer still that can satisfy all 192 MRs in 64 clocks.
There are quite a few recent papers exploring having huge numbers of
Miss Status Holding Registers (MSHR) in FPGA's using BRAM's,
allowing numbers like 2048 concurrent misses. e.g.:
Stop crying over your cache miss rate:
Handling efficiently thousands of outstanding misses in fpgas, 2019 https://dl.acm.org/doi/pdf/10.1145/3289602.3293901
Also some papers exploring optimizing scatter-gathers. e.g.:
Piccolo: Large-scale graph processing with
fine-grained in-memory scatter-gather, 2025
https://arxiv.org/abs/2503.05116
...What does this have to do with Github?
Andy Valencia <[email protected]> posted:
MitchAlsup <[email protected]d> writes:
There are simple PRNGs that create very 'white' RNG sequences. However,
a generated RN is used to index a table of previously computed RNs,
and then swap the accessed one with the generated one. The table goes
a long way in 'whitening' the RNG.
Are you aware of any example code on Github? I'd be interested in
the implementation details of a decent realization of this technique.
Roughly:
# define shift (8)
# define mask ((~0)<<shift)
static unsigned tblindx = 0;
static unsigned table[ 1<<shift ];
extern unsigned PRNG();
unsigned WhiteRNG()
{
if( tblindx == 0 ) tblindx = PRNG();
index = tblindx & mask;
tblindx >>= shift;
RNG = table[ index ];
table[ index ] = PRNG();
return RNG;
}
will whiten any reasonable PRNG. There are a variety of ways to index
the table that make little difference in the outcome. 20 years ago I
knew the name for this, but I could not find it on my shelf.
Thank you,
Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
EricP wrote:
MitchAlsup wrote:
Michael S <[email protected]> posted:
Besides, there are other reasons why modern big OoO will run rounds
around Cray, either 1 or 2, in that sort of test. Most important are
1. Compilers are not perfect. 2. Even if compilers were perfect, Cray >>> has far fewer physical
register than Big OoO, which would inevitably lead to far fewer memory >>> accesses running simultaneously.
CRAY-Y/MP could run 2LDs and 1ST where the LDs were of gather-type and
STs of the scatter type. So, 3 instructions would create 192 memory
references. And all 192 references could be 'satisfied' in 64-clocks.
I know of no GBOoO machine with 192 outstanding memory references, and
fewer still that can satisfy all 192 MRs in 64 clocks.
There are quite a few recent papers exploring having huge numbers of
Miss Status Holding Registers (MSHR) in FPGA's using BRAM's,
allowing numbers like 2048 concurrent misses. e.g.:
Stop crying over your cache miss rate:
Handling efficiently thousands of outstanding misses in fpgas, 2019 https://dl.acm.org/doi/pdf/10.1145/3289602.3293901
Also some papers exploring optimizing scatter-gathers. e.g.:
Piccolo: Large-scale graph processing with
fine-grained in-memory scatter-gather, 2025 https://arxiv.org/abs/2503.05116
Thinking about optimizing scatter-gathers...
The current design is optimized for moving 64B+ECC cache lines.
The DIMM has 9 DDR DRAM chips which are read at the same row
and columns in parallel, with 4 clocks to read a 64/72 ECC word,
and 8 times that = 32 clocks for a 64B-ECC line.
It moves to the cache (more clocks) and moves up through the cache
hierarchy to the core where we finally extract/insert 1 data item.
But for scatter-gather (SG) we typically only need one data item
of 4 or 8 bytes out of that 64B line so most of that was waste.
Furthermore the DRAMs on a DIMM are all being used serially
at the same time, with no concurrency.
It could be possible to send an SG read or write packet containing up to
64 64-bit physical addresses + data for writes to the memory controller,
and have it perform the whole scatter gather optimally.
Internally each DRAM chip is composed of some number of banks,
each bank with some number of subarrays, each subarray with a set of independent sense amps and latches, of rows and columns of bits.
It could be possible to open a separate row on each subarray in each bank.
It could be possible to read/write multiple subarrays at the same time
with each subarray IO routed to a separate pin.
The SG memory controller would be redesigned to read and write
multiple 72 bit data items concurrently by:
- instead of storing individual bits of 9-bit bytes (data+parity) spread
across separate DRAM chips on a DIMM, store those 9bytes as bits in
the same row of a single subarray of a single DRAM chip.
- change the DRAM row size to be a multiple of 9-bit bytes
- a 72-bit word would be split as two 36-bit half words across pairs of
subarrays both read at once, which using DDR would be 18 clocks.
(Compared to the 32 clocks for reading whole 64B cache lines.)
- multiple subarray pairs can be opened and read/written concurrently
- each DIMM has 8 DRAM chips, and each chip can have all subarray pairs
open at different rows, and can read/write multiple 72-bit words
to different subarray pairs a once.
If each DIMM has 8 DRAM chips, and each chip had 8 subarray IO pins,
this could perform a 64-way scatter-gather of 72-bit words in as little
as 18 clocks with data words in separate subarrays. Or a worst case where
all the physical addresses land in the same subarray, 64*18 clocks.
Michael S <[email protected]> writes:
On Wed, 4 Mar 2026 07:44:39 -0800
Stephen Fuld <[email protected]d> wrote:
On 3/3/2026 1:55 PM, Scott Lurndal wrote:
Stephen Fuld <[email protected]d> writes: =20=20
On 3/3/2026 11:01 AM, MitchAlsup wrote: =20=20
Stephen Fuld <[email protected]d> posted:
=20
On 3/1/2026 1:13 PM, MitchAlsup wrote: =20
Stephen Fuld <[email protected]d> posted: =20
snip
=20
cpu designers minimize latency at a given BW, while
Long term store designers maximize BW at acceptable latency.
Completely different design points. =20
Perhaps that it true now, but it certainly didn't use to be. In >>>>>>> 1979-1980 I wrote the microcode to add caching to my employer's
disk controller, making it the industry's first true cache disk
controller. This was almost all about reducing latency (from
tens of milliseconds on a non cache controller to hundreds of
microseconds on a cache hit). There was a small improvement in
transfer rate, but the latency reduction dominated the
improvement. =20
There is an SSD that can perform 3,300,000=D74096B random read
transfers per second on a PCIe 5.0-=D74 connector. That is 13.2GB/s >>>>>> over the PCIe link which is BW limited to 15.x GB/s. Each "RAS"
has a 70=B5s access delay. =20
Wow! But I think you will agree that design is unlikely to be
used for "mass market" hundreds of terabyte systems used for
commercial database systems, etc. =20
I think you'll find that the commercial databases are dominated
by high-end NVMe (PCI based SSDs) for working storage, with
spinning rust as archive storage.
=20
For example, a 61TB MVME PCIe gen5 SSD for USD7,700.00.
=20
https://techatlantix.com/mzwmo61thclf-00aw7.html =20
I freely admit that I am "out of the loop" for modern systems.
However, this system costs about $125/TB. A quick check of Amazon
shows typical hard disk prices at about $25 /TB.=20
That's not a fare comparison.
Indeed, it's not even a fair comparision :-)
$25/TB you see on Amazon is likely 5400 rpm 4TB disk.
So, at 61 TB you only have 15 spindles. It means that even in ideal
conditions of accesses distributed evenly to all disks your sequential
read bandwidth is no better than ~1,200 MB/s.
For comparison, sequential read speed of BM1743 SSD is 7,500 MB/s.
In order to get comparable bandwidth with HDs you will need
95 spindles. May be, a little less with 7200 rpm. I don't know how many
spindles would you need with 13000 rpm HDs that enterprises used to use
for databases 20+ years ago. It seems, they had better latency than
7200, but about the same bandwidth. Anyway, I m not even sure that
anybody makes them still.
95 disks alone almost certainly cost more than 8KUSD. And on top of
that you will need a dozen of expensive RAID controllers.
So, essentially, when you paid 8KUSD for this SSD, you paid it for
bandwidth alone. 100x improvement in latency that you got over HD-based
solution is a free bonus. Density and power too.
Density and power are the most important criteria in modern
datacenters. Reliability is also a consideration. While
the backblaze drive reports show moderately reasonable results
for most spinning rust, the NVME SSD is far more reliable and
consumes far less power and rack space than the equivalent
hard disk would.
Another advantage of NVMe cards is the availablity of
PCI SR-IOV, which allows the NVMe card to be partitioned
and made available to multiple independent guests without
sacrificing security. The downside of host-based NVMe is
the inability to share bandwidth with multiple hosts. That
downside is eliminated with external NVMe based RAID
subsystems connected via 400Gbe or FC.
https://www.truenas.com/r-series/r60/
7PB at 60GB/sec.
Idle:
Wonders if some of the approaches used in SSDs could be used to make higher-density DRAM.
Say, rather than 1-bit per DRAM cell, they can hold multiple bits.
Granted, this might require the RAM to implement its own refresh process
to retain stability; and possibly some changes to the RAM protocols
(events like selecting rows may need to be made to be able to handle variable latency).
Say, for example, rather than changing a row and waiting a fixed RAS latency, it is changing a row and waiting for the RAM to signal back
that the active row has been changed. Likely CAS could remain fixed
latency though.
Say, for example, if RAS moves a row into an internal SRAM cache, with
CAS accessing this SRAM. Closing the row or changing rows then writing
these back to the internal DRAM cells, with a per-row autonomous refresh built into the RAM.
Like with SSDs, there could probably be invisible ECC in the RAM to deal with cases where the DRAM cells deteriorate between refresh cycles.
...
Cray-2 memory was big enough, but was it fast enough latency wise?
All info I see about Cray-2 memory praises its great capacity and
bandwidth, but tells nothing about latency.
It seems, the latency was huge and the whole system was useful only due
to mechanism that today we will call hardware prefetch. Or software
prefetch? I am not sure.
But it is possible that I misunderstood.
On Wed, 04 Mar 2026 23:46:46 GMT
MitchAlsup <[email protected]d> wrote:
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated one.
The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
I know how to build extremely good (but not crypto quality) and
reasonably fast PRNG completely in CPU registers (hint: all modern cores
can do a round of AES as a single reg-reg instruction). But such PRNG
can not have standard API.
Michael S <[email protected]> schrieb:
Cray-2 memory was big enough, but was it fast enough latency wise?
All info I see about Cray-2 memory praises its great capacity and bandwidth, but tells nothing about latency.
It seems, the latency was huge and the whole system was useful only
due to mechanism that today we will call hardware prefetch. Or
software prefetch? I am not sure.
But it is possible that I misunderstood.
The Cray architecture had vector registers, which was an advantage
over the memory-to-memory architectures like the Cyber 205.
And it could run calculations in parallel with memory operations.
I guess you cann call loading a vector register from memory a
"software prefectch" if you want, but it would be a bit of a
stretch of the terminology.
Michael S <[email protected]> schrieb:
On Wed, 04 Mar 2026 23:46:46 GMT
MitchAlsup <[email protected]d> wrote:
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated
one. The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
I know how to build extremely good (but not crypto quality) and
reasonably fast PRNG completely in CPU registers (hint: all modern
cores can do a round of AES as a single reg-reg instruction). But
such PRNG can not have standard API.
You can use "darn" on POWER, of course :-)
But why no standard API?
On Sat, 7 Mar 2026 13:29:00 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Michael S <[email protected]> schrieb:
On Wed, 04 Mar 2026 23:46:46 GMT
MitchAlsup <[email protected]d> wrote:
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated
one. The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
I know how to build extremely good (but not crypto quality) and reasonably fast PRNG completely in CPU registers (hint: all modern
cores can do a round of AES as a single reg-reg instruction). But
such PRNG can not have standard API.
You can use "darn" on POWER, of course :-)
But why no standard API?
Standard PRNG APIs assume that PRNG state is stored in some type of
object. In older, more primitive APIs, like C RTL rand/srand, there is
one global object. In more modern APIs, user can have as many objects
as he wishes, but they are still object, somewhere in memory. One API
call delivers one random number and updates an object (or the object).
So, there is inevetably at least one memory read and one memory write
going on per each API call.
In scenario described by Kent Dickey acesses
to PRNG object will have very good temporal locality
. Which means that
on CPUs with cache they will have very good hit rate. Which means that
CPUs with cache will do that part of test much faster than cacheless
Cray processors.
Michael S <[email protected]> schrieb:
On Wed, 04 Mar 2026 23:46:46 GMT
MitchAlsup <[email protected]d> wrote:
There are simple PRNGs that create very 'white' RNG sequences.
However, a generated RN is used to index a table of previously
computed RNs, and then swap the accessed one with the generated
one. The table goes a long way in 'whitening' the RNG.
So, good PRNGs are not memory reference free on the data side. But,
on the other hand, the table does not have to be "that big".
I know how to build extremely good (but not crypto quality) and
reasonably fast PRNG completely in CPU registers (hint: all modern
cores can do a round of AES as a single reg-reg instruction). But
such PRNG can not have standard API.
You can use "darn" on POWER, of course :-)
But why no standard API?
Stefan Monnier <[email protected]> writes:
...What does this have to do with Github?
It's the industry standard way to share code?
Andy Valencia--- Synchronet 3.21d-Linux NewsLink 1.2
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do it.
Idle:
Wonders if some of the approaches used in SSDs could be used to make
higher-density DRAM.
It comes with 2 consequences::
a) RAS time would at least double (30ns->60ns) but likely do 4×
b) refresh rate would go up by factor of 4× minimum.
Say, rather than 1-bit per DRAM cell, they can hold multiple bits.
Granted, this might require the RAM to implement its own refresh process
to retain stability; and possibly some changes to the RAM protocols
(events like selecting rows may need to be made to be able to handle
variable latency).
Say, for example, rather than changing a row and waiting a fixed RAS
latency, it is changing a row and waiting for the RAM to signal back
that the active row has been changed. Likely CAS could remain fixed
latency though.
Say, for example, if RAS moves a row into an internal SRAM cache, with
CAS accessing this SRAM. Closing the row or changing rows then writing
these back to the internal DRAM cells, with a per-row autonomous refresh
built into the RAM.
Like with SSDs, there could probably be invisible ECC in the RAM to deal
with cases where the DRAM cells deteriorate between refresh cycles.
...
On 3/4/2026 11:17 AM, Michael S wrote:
On Wed, 4 Mar 2026 19:38:08 +0100
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Wed, 4 Mar 2026 19:03:54 +0100
Terje Mathisen <[email protected]> wrote:
Stefan Monnier wrote:
Let me try again. Suppose you had a (totally silly) program
with a 2 GB array, and you used a random number generate an
address within it, then added the value at that addressed byte
to an accumulator. Repeat say 10,000 times. I would call
this program latency bound, but I suspect Anton would call it
bandwidth bound. If that is true, then that explains the
original differences Anton and I had.
I think in theory, this is not latency bound: assuming enough
CPU and memory parallelism in the implementation, it can be
arbitrarily fast. But in practice it will probably be
significantly slower than if you were to do a sequential
traversal.
10K selected from 2G means average distance of 200K, so you get
effectively very close to zero cache hits, and even TLB misses
might be very significant unless you've setup huge pages.
Relatively horrible.
A human time scale it would still be very fast.
Assuming TLB+$L3+$L2+$L1 misses on every access the actual
runtime will be horrible!
Indeed, in practice you may sometimes see the performance be
correlated with your memory latency, but if so it's only because
your hardware doesn't offer enough parallelism (e.g. not enough
memory banks).
AFAIK, when people say "latency-bound" they usually mean that
adding parallelism and/or bandwidth to your memory hierarchy
won't help speed it up (typically because of pointer-chasing).
This is important, because it's a *lot* more difficult to reduce
memory latency than it is to add bandwidth or parallelism.
When the working set does not allow any cache re-use, then a
classic Cray could perform much better than a modern OoO cpu.
Terje
When working set does not allow any cache re-use then it does not
fit in classic Cray's main memory.
The 1985 Cray-2 allowed 2GB, so theoretically possible with the
OS+program into the 73 MB gap between 2GiB and 2E9.
Easily done on a later Cray-Y-MP.
I had Cray-1 in mind.
Cray-2 memory was big enough, but was it fast enough latency wise?
All info I see about Cray-2 memory praises its great capacity and bandwidth, but tells nothing about latency.
It seems, the latency was huge and the whole system was useful only
due to mechanism that today we will call hardware prefetch. Or
software prefetch? I am not sure.
But it is possible that I misunderstood.
Well, Cray had that vector thing going for it. :-) And, as Mitch
has repeatedly pointed out, the memory bandwidth to support it. And "reasonable" memory latency for the non-vector operations.
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
Idle:As far as putting 2-bits in a single DRAM cell, yes you could do it.
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
Roughly:This is a modification of Knuth Algorithm M in section 3.2.2 of The Art
...
of Computer Programming, Vol. 2, Seminumerical Algorithms, where there
is a detailed discussion on it.
It does help to make simple LFSR PRNGs much better, but XORSHIFT128P (or others like it) is better for 64-bit CPUs (see https://en.wikipedia.org/wiki/Xorshift).
I guess you cann call loading a vector register from memory a
"software prefectch" if you want, but it would be a bit of a
stretch of the terminology.
On Sat, 07 Mar 2026 01:49:31 GMT
MitchAlsup <[email protected]d> wrote:
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do it.
Idle:
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
How exactly?
With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
Then you connect your storage capacitor to bit line an it pulls it
slightly up or pushes it slightly down. Then open-loop sense amplifier detects this slight change in voltage of bit line.
I don't see how any of that will work to distinguish between 4 levels of volatge instead of 2 levels. Much less so, to distinguish between 16
levels, as in QLC flash.
In flash memory it works very differently.
Not that I understand how it works completely, but I understand enough
to know that flash cell's charge is not discharged into higher
capacitance of bit line on every read operation.
Thomas Koenig <[email protected]> writes:
I guess you cann call loading a vector register from memory a
"software prefectch" if you want, but it would be a bit of a
stretch of the terminology.
That's like saying that Humpty Dumpty stretched the terminology of
"glory".
Software prefetch instructions are architectural noops.
Microarchitecturally, they may load the accessed memory into a cache. Likewise, hardware prefetchers load some memory into a cache.
Loading a vector register from memory has an architectural effect.
And given that the classic Crays don't have caches, prefetching
instructions would always be noops.
- anton
On 07/03/2026 23:28, Michael S wrote:
On Sat, 07 Mar 2026 01:49:31 GMT
MitchAlsup <[email protected]d> wrote:
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do
Idle:
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
it.
How exactly?
With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
Then you connect your storage capacitor to bit line an it pulls it
slightly up or pushes it slightly down. Then open-loop sense
amplifier detects this slight change in voltage of bit line.
I don't see how any of that will work to distinguish between 4
levels of volatge instead of 2 levels. Much less so, to distinguish
between 16 levels, as in QLC flash.
In flash memory it works very differently.
Not that I understand how it works completely, but I understand
enough to know that flash cell's charge is not discharged into
higher capacitance of bit line on every read operation.
In theory you could have your DRAM cell hold different voltage
levels. Your description of how DRAM works is mostly fine (AFAIUI - I
am not a chip designer), except that I think the read sense
amplifiers must have much lower input capacitance than the cells
storage capacitors.
It would certainly be possible to put the write line at four
different levels rather than just high or low - just like you can do
with a flash cell. The trouble is that where the flash cell has
extremely low leakage of its stored charge, DRAM cells use a small
capacitor and have lots of leakage. As soon as you stop writing, the capacitor charge leaks back and forth to ground, to positive supply,
to control lines, to neighbour cells, and so on. Reducing the
voltage change from that leakage means using a bigger capacitor,
which is slower to write.
All this means that some time after you have written to a DRAM cell,
the voltage on the cell capacitor is different from what you wrote.
You have to read it, and re-write it, before the voltage changes
enough to be misinterpreted. And the very act of reading changes the
cell's charge too, depending on leakage to the nearby read lines, and
the state of the input capacitance on the read lines. Some of all
these leakages, changes and influences is relatively predictable from
the layout of the chips - other parts vary significantly depending on temperature, the values in neighbouring cells, and the read/write
patterns. (Remember "Rowhammer" ?)
Storing 2 bits per cells makes all this hugely worse. And that means
you need bigger storage capacitors to reduce the effect - making the
cells bigger, slower and requiring more power. Your read and write circuitry becomes significantly more complex (and big, slow, and
power hungry). And you have to have far shorter refresh cycles (you
guessed it - it makes things bigger, slower and more power-hungry).
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power. Clearly it would not be a good idea. But you /could/ do
it.
Cray-2 had no cache, same as 1, X-MP and Y-MP. Nevertheless, unlike
1, and according to my understanding, unlike X-MP and Y-MP, memory
hierarchy of Cray-2 was not flat.
Here is a paragraph from Wikipedia:
"To avoid this problem the new design banked memory and two sets of
registers (the B- and T-registers) were replaced with a 16 KWord block
of the very fastest memory possible called a Local Memory, not a cache, >attaching the four background processors to it with separate high-speed >pipes. This Local Memory was fed data by a dedicated foreground
processor which was in turn attached to the main memory through a
Gbit/s channel per CPU; X-MPs by contrast had three, for two
simultaneous loads and a store and Y-MP/C-90s had five channels to
avoid the von Neumann bottleneck. It was the foreground processor's
task to "run" the computer, handling storage and making efficient use
of the multiple channels into main memory.
On Sun, 8 Mar 2026 12:13:24 +0100
David Brown <[email protected]> wrote:
On 07/03/2026 23:28, Michael S wrote:
On Sat, 07 Mar 2026 01:49:31 GMT
MitchAlsup <[email protected]d> wrote:
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do
Idle:
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
it.
How exactly?
With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
Then you connect your storage capacitor to bit line an it pulls it
slightly up or pushes it slightly down. Then open-loop sense
amplifier detects this slight change in voltage of bit line.
I don't see how any of that will work to distinguish between 4
levels of volatge instead of 2 levels. Much less so, to distinguish
between 16 levels, as in QLC flash.
In flash memory it works very differently.
Not that I understand how it works completely, but I understand
enough to know that flash cell's charge is not discharged into
higher capacitance of bit line on every read operation.
In theory you could have your DRAM cell hold different voltage
levels. Your description of how DRAM works is mostly fine (AFAIUI - I
am not a chip designer), except that I think the read sense
amplifiers must have much lower input capacitance than the cells
storage capacitors.
It's not an amplifier that has high capacitance, It's a bit line.
And it is inevitable*.
The rest of your post does not make a lot of sense, because
you don't take it into account. Most of the things you wrote there are correct, but they make no practical difference, because of this high capacitance.
* - inevitable for as long as you want to have 4-8K rows per bank.
But there are plenty of good reasons why you do you want to have many
rows per bank, if not 4K then at very least 1K. I can list some
reasons, but (1) it would be off topic, (2) I am not a specialist, so
there is a danger of me being wrong in details.
It would certainly be possible to put the write line at four
different levels rather than just high or low - just like you can do
with a flash cell. The trouble is that where the flash cell has
extremely low leakage of its stored charge, DRAM cells use a small
capacitor and have lots of leakage. As soon as you stop writing, the
capacitor charge leaks back and forth to ground, to positive supply,
to control lines, to neighbour cells, and so on. Reducing the
voltage change from that leakage means using a bigger capacitor,
which is slower to write.
All this means that some time after you have written to a DRAM cell,
the voltage on the cell capacitor is different from what you wrote.
You have to read it, and re-write it, before the voltage changes
enough to be misinterpreted. And the very act of reading changes the
cell's charge too, depending on leakage to the nearby read lines, and
the state of the input capacitance on the read lines. Some of all
these leakages, changes and influences is relatively predictable from
the layout of the chips - other parts vary significantly depending on
temperature, the values in neighbouring cells, and the read/write
patterns. (Remember "Rowhammer" ?)
Storing 2 bits per cells makes all this hugely worse. And that means
you need bigger storage capacitors to reduce the effect - making the
cells bigger, slower and requiring more power. Your read and write
circuitry becomes significantly more complex (and big, slow, and
power hungry). And you have to have far shorter refresh cycles (you
guessed it - it makes things bigger, slower and more power-hungry).
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power. Clearly it would not be a good idea. But you /could/ do
it.
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
On 08/03/2026 12:37, Michael S wrote:
On Sun, 8 Mar 2026 12:13:24 +0100
David Brown <[email protected]> wrote:
On 07/03/2026 23:28, Michael S wrote:
On Sat, 07 Mar 2026 01:49:31 GMT
MitchAlsup <[email protected]d> wrote:
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do
Idle:
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
it.
How exactly?
With 1 bit per cell, you pre-charge a bit line to mid-level
volatge. Then you connect your storage capacitor to bit line an
it pulls it slightly up or pushes it slightly down. Then
open-loop sense amplifier detects this slight change in voltage
of bit line.
I don't see how any of that will work to distinguish between 4
levels of volatge instead of 2 levels. Much less so, to
distinguish between 16 levels, as in QLC flash.
In flash memory it works very differently.
Not that I understand how it works completely, but I understand
enough to know that flash cell's charge is not discharged into
higher capacitance of bit line on every read operation.
In theory you could have your DRAM cell hold different voltage
levels. Your description of how DRAM works is mostly fine (AFAIUI
- I am not a chip designer), except that I think the read sense
amplifiers must have much lower input capacitance than the cells
storage capacitors.
It's not an amplifier that has high capacitance, It's a bit line.
And it is inevitable*.
The rest of your post does not make a lot of sense, because
you don't take it into account. Most of the things you wrote there
are correct, but they make no practical difference, because of this
high capacitance.
* - inevitable for as long as you want to have 4-8K rows per bank.
But there are plenty of good reasons why you do you want to have
many rows per bank, if not 4K then at very least 1K. I can list some reasons, but (1) it would be off topic, (2) I am not a specialist,
so there is a danger of me being wrong in details.
Ultimately, it doesn't make a big difference if the capacitance here
is in the sense amplifier, or the lines feeding it - though I can
fully appreciate that the majority of the capacitance is in the lines
here.
If the line capacitance here is so much higher than the cell
capacitor (and I'll happily bow to your knowledge here), then it
means the voltage seen by the sense capacitor will be a fraction of
the charge voltage on the cell capacitor. Fair enough - it just
means the voltage threshold between a 1 and a 0 is that much lower.
But the rest of the argument remains basically the same. If you want
to hold more than one bit in the cell, you need to distinguish
between four voltage levels instead of 2 voltage levels. The
relative differences in voltages are the same, the influences on cell capacitor leakage are the same. But as the absolute voltage levels
into the sense amplifier are now much smaller, other noise sources
(like thermal noise) are relatively speaking more important, which
reduces your error margins even more. So again we are left with the multi-level DRAM cell being theoretically possible, but now the
practicality is even worse - and you probably also need to reduce the
number of bits per sense amplifier line to reduce the capacitance.
As far as I can very roughly estimate, I believe Mitch's comment that
you could put 2 bits in a single DRAM cell, but I think the
disadvantages would be more dramatic than he suggested.
It does not
surprise me that nobody makes multi-level DRAM cells (AFAIK).
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
According to https://en.wikipedia.org/wiki/Multi-level_cell:
In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
chip with quad-level cells, holding a capacity of 4 Gbit.
So apparently it's been done. I can't find any reference for that
claim, tho. Has anyone heard of it?
Comparing the situation of DRAM vs flash storage, I notice that MLC
cells in flash storage sometimes read the data multiple times to get
a more precise reading of the voltage level (IIUC). DRAM doesn't really
have that option.
=== Stefan
IA-64 certainly had significantly larger code sizes than others,
but I think that they expected it and found it acceptable.
And a software-pipelined loop on IA-64 is probably smaller than an auto-vectorized loop on AMD64+AVX
Actually, other architectures also added prefetching instructions
for dealing with that problem. All I have read about that was that
there were many disappointments when using these instructions.
I don't know if there were any successes, and how frequent they
were compared to the disappointments.
So I don't see that IA-64 was any different from other architectures
in that respect.
OoO helps in several ways: it will do some work in the shadow of the
load (although the utilization will still be abysmal even with
present-day schedulers and ROBs [1]); but more importantly, it can
dispatch additional loads that may also miss the cache, resulting in
more memory-level parallelism.
They wanted to do it (and did it) in the compiler; the corresponding architectural feature is IIRC the advanced load.
Having so many registers may have mad it harder than otherwise, but
SPARC also used many registers
The issue is that speculative execution and OoO makes all the
EPIC features of IA-64 unnecessary, so if they cannot do
a fast in-order implementation of IA-64 (and they could not), they
should just give up and switch to an architecture without these
features, such as AMD64. And Intel did, after a few years of
denying.
In a world where we see convergence on fewer and fewer architecture
styles and on fewer and fewer architectures, you only see the
investment necessary for high-performance implementations of a new architecture if there is a very good reason not to use one of the
established architectures (for ARM T32 and ARM A64 the smartphone
market was that reason). It may be that politics will provide that
reason for another architecture, but even then it's hard. But
RISC-V seems to have the most mindshare among the alternatives,
so if any architecture will catch up, it looks like the best bet.
On Sun, 8 Mar 2026 15:10:52 +0100
David Brown <[email protected]> wrote:
On 08/03/2026 12:37, Michael S wrote:
On Sun, 8 Mar 2026 12:13:24 +0100
David Brown <[email protected]> wrote:
On 07/03/2026 23:28, Michael S wrote:
On Sat, 07 Mar 2026 01:49:31 GMT
MitchAlsup <[email protected]d> wrote:
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do
Idle:
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
it.
How exactly?
With 1 bit per cell, you pre-charge a bit line to mid-level
volatge. Then you connect your storage capacitor to bit line an
it pulls it slightly up or pushes it slightly down. Then
open-loop sense amplifier detects this slight change in voltage
of bit line.
I don't see how any of that will work to distinguish between 4
levels of volatge instead of 2 levels. Much less so, to
distinguish between 16 levels, as in QLC flash.
In flash memory it works very differently.
Not that I understand how it works completely, but I understand
enough to know that flash cell's charge is not discharged into
higher capacitance of bit line on every read operation.
In theory you could have your DRAM cell hold different voltage
levels. Your description of how DRAM works is mostly fine (AFAIUI
- I am not a chip designer), except that I think the read sense
amplifiers must have much lower input capacitance than the cells
storage capacitors.
It's not an amplifier that has high capacitance, It's a bit line.
And it is inevitable*.
The rest of your post does not make a lot of sense, because
you don't take it into account. Most of the things you wrote there
are correct, but they make no practical difference, because of this
high capacitance.
* - inevitable for as long as you want to have 4-8K rows per bank.
But there are plenty of good reasons why you do you want to have
many rows per bank, if not 4K then at very least 1K. I can list some
reasons, but (1) it would be off topic, (2) I am not a specialist,
so there is a danger of me being wrong in details.
Ultimately, it doesn't make a big difference if the capacitance here
is in the sense amplifier, or the lines feeding it - though I can
fully appreciate that the majority of the capacitance is in the lines
here.
If the line capacitance here is so much higher than the cell
capacitor (and I'll happily bow to your knowledge here), then it
means the voltage seen by the sense capacitor will be a fraction of
the charge voltage on the cell capacitor. Fair enough - it just
means the voltage threshold between a 1 and a 0 is that much lower.
No, it does not mean that.
In 1-bit scenario your sense amplifier works in open loop - it
amplifies difference between precharge voltage and observed voltage on
bit line as strongly as it can, All it cares about is the sign of
difference. It does not care about exact ratio between capacitance of
bit line and capacitance of storage capacitor. It also does not care
if cell was freshly refreshed or is near the end of refresh interval.
It does not care about its own gain, as long as gain is high enough.
All that make distinguishing between 2 levels not just a little
simpler, but a whole lot simpler than distinguishing between multiple
level, even when "multiple" is just 4 or even 3.
Once again, I am not specialist, but would imagine that latter would
require totally different level of precision, both in the value of
storage capacitor and in the capacitance of bit line.
Amplifier itself would have to be much more complicated, too. It will
likely need two stages - 1st sample-and-hold and only then set of comparators. All that will likely increase the power of row access by
much bigger factor than 2x.
But I don't think that the whole idea could proceed far enough to start
to care about power. Above-mentioned requirements of geometrical
precision will kill it much earlier.
But the rest of the argument remains basically the same. If you want
to hold more than one bit in the cell, you need to distinguish
between four voltage levels instead of 2 voltage levels. The
relative differences in voltages are the same, the influences on cell
capacitor leakage are the same. But as the absolute voltage levels
into the sense amplifier are now much smaller, other noise sources
(like thermal noise) are relatively speaking more important, which
reduces your error margins even more. So again we are left with the
multi-level DRAM cell being theoretically possible, but now the
practicality is even worse - and you probably also need to reduce the
number of bits per sense amplifier line to reduce the capacitance.
As far as I can very roughly estimate, I believe Mitch's comment that
you could put 2 bits in a single DRAM cell, but I think the
disadvantages would be more dramatic than he suggested.
That's my point.
At the end, not only would you have worse power and worse speed, but
worse density as well.
It does not
surprise me that nobody makes multi-level DRAM cells (AFAIK).
On the other hand, it is possible that the second idea that makes NAND
so dense could apply to DRAM with better chance of achieving something positive. I mean, 3D QLC flash is so dense not only due to QLC, but
also due to 3D.
Stefan Monnier wrote:
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
According to https://en.wikipedia.org/wiki/Multi-level_cell:
In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
chip with quad-level cells, holding a capacity of 4 Gbit.
So apparently it's been done. I can't find any reference for that
claim, tho. Has anyone heard of it?
Comparing the situation of DRAM vs flash storage, I notice that MLC
cells in flash storage sometimes read the data multiple times to get
a more precise reading of the voltage level (IIUC). DRAM doesn't really
have that option.
=== Stefan
A quicky search for "multi-bit" "dram" finds a recent paper:
IGZO 2T0C DRAM With VTH Compensation Technique
for Multi-Bit Applications, 2025 https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf
which demonstrates 3 bits per cell but in just 25 cells.
"we proposed and experimentally demonstrated the novel dual-gate (DG) indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C) dynamic random-access memory (DRAM) for array-level multi-bit storage.
...
the optimized transistors... enable long retention time (>1500 s)
and ultra-fast writing speed (< 10 ns).
...
non-overlap 3-bit storage operation among 25 cells is achieved
...
Recent research efforts have mostly focused on developing capacitor-less two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural innovation primarily aims to save the large space occupied by storage capacitor in conventional one-transistor-one-capacitor (1T1C) design
...
Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
non-destructive, which enables multi-bit storage in bit-cell"
No, it does not mean that.
In 1-bit scenario your sense amplifier works in open loop - it
amplifies difference between precharge voltage and observed voltage on
bit line as strongly as it can, All it cares about is the sign of
difference. It does not care about exact ratio between capacitance of
bit line and capacitance of storage capacitor. It also does not care
if cell was freshly refreshed or is near the end of refresh interval.
It does not care about its own gain, as long as gain is high enough.
All that make distinguishing between 2 levels not just a little
simpler, but a whole lot simpler than distinguishing between multiple
level, even when "multiple" is just 4 or even 3.
On 08/03/2026 17:30, Michael S wrote:
On Sun, 8 Mar 2026 15:10:52 +0100
On the other hand, it is possible that the second idea that makes
NAND so dense could apply to DRAM with better chance of achieving
something positive. I mean, 3D QLC flash is so dense not only due
to QLC, but also due to 3D.
Scaling in the third direction would surely be good, yes. The
challenge, I would think, would be heat dissipation. Flash does not
need power to hold its data, so for the same speed of reading and
writing you have basically the same power requirements regardless of
the number of layers. With DRAM, power and heat would scale with the
layers due to refreshes. You already see heat sinks for fast DRAM,
so for a multi-layer device you'd need to put a lot more effort into
cooling.
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
According to https://en.wikipedia.org/wiki/Multi-level_cell:
In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
chip with quad-level cells, holding a capacity of 4 Gbit.
So apparently it's been done. I can't find any reference for that
claim, tho. Has anyone heard of it?
Comparing the situation of DRAM vs flash storage, I notice that MLC
cells in flash storage sometimes read the data multiple times to get
a more precise reading of the voltage level (IIUC). DRAM doesn't really
have that option.
=== Stefan
Michael S <[email protected]> schrieb:
No, it does not mean that.
In 1-bit scenario your sense amplifier works in open loop - it
amplifies difference between precharge voltage and observed voltage on
bit line as strongly as it can, All it cares about is the sign of
difference. It does not care about exact ratio between capacitance of
bit line and capacitance of storage capacitor. It also does not care
if cell was freshly refreshed or is near the end of refresh interval.
It does not care about its own gain, as long as gain is high enough.
All that make distinguishing between 2 levels not just a little
simpler, but a whole lot simpler than distinguishing between multiple
level, even when "multiple" is just 4 or even 3.
I want a balanced ternary computer, and I want it NOW!
Tonight’s tradeoff was having the memory page size determined by the
root pointer. A few bits (5) in a root pointer could be used to set the
page size. All references through that root pointer would then use the specified page size. When the root pointer changes, the page size goes
along with it.
I think not flushing the TLB could be got away with, with ASID matching
on the entries. For a given ASID the page size would be consistent with
the root pointer.
Alternately the TLB entry could be tagged with the root pointer register number, so if a different root pointer register is used the entry would
not match.
I have been studying the 68851 MMU. Quite complex compared to some other MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.
On 08/03/2026 12:37, Michael S wrote:
On Sun, 8 Mar 2026 12:13:24 +0100
David Brown <[email protected]> wrote:
On 07/03/2026 23:28, Michael S wrote:
On Sat, 07 Mar 2026 01:49:31 GMT
MitchAlsup <[email protected]d> wrote:
BGB <[email protected]> posted:
----see how easy it is to snip useless material-------
As far as putting 2-bits in a single DRAM cell, yes you could do
Idle:
Wonders if some of the approaches used in SSDs could be used to
make higher-density DRAM.
it.
How exactly?
With 1 bit per cell, you pre-charge a bit line to mid-level volatge.
Then you connect your storage capacitor to bit line an it pulls it
slightly up or pushes it slightly down. Then open-loop sense
amplifier detects this slight change in voltage of bit line.
I don't see how any of that will work to distinguish between 4
levels of volatge instead of 2 levels. Much less so, to distinguish
between 16 levels, as in QLC flash.
In flash memory it works very differently.
Not that I understand how it works completely, but I understand
enough to know that flash cell's charge is not discharged into
higher capacitance of bit line on every read operation.
In theory you could have your DRAM cell hold different voltage
levels. Your description of how DRAM works is mostly fine (AFAIUI - I
am not a chip designer), except that I think the read sense
amplifiers must have much lower input capacitance than the cells
storage capacitors.
It's not an amplifier that has high capacitance, It's a bit line.
And it is inevitable*.
The rest of your post does not make a lot of sense, because
you don't take it into account. Most of the things you wrote there are correct, but they make no practical difference, because of this high capacitance.
* - inevitable for as long as you want to have 4-8K rows per bank.
But there are plenty of good reasons why you do you want to have many
rows per bank, if not 4K then at very least 1K. I can list some
reasons, but (1) it would be off topic, (2) I am not a specialist, so
there is a danger of me being wrong in details.
Ultimately, it doesn't make a big difference if the capacitance here is
in the sense amplifier, or the lines feeding it - though I can fully appreciate that the majority of the capacitance is in the lines here.
If the line capacitance here is so much higher than the cell capacitor
(and I'll happily bow to your knowledge here), then it means the voltage seen by the sense capacitor will be a fraction of the charge voltage on
the cell capacitor. Fair enough - it just means the voltage threshold between a 1 and a 0 is that much lower. But the rest of the argument remains basically the same. If you want to hold more than one bit in
the cell, you need to distinguish between four voltage levels instead of
2 voltage levels.
The relative differences in voltages are the same,
the influences on cell capacitor leakage are the same. But as the
absolute voltage levels into the sense amplifier are now much smaller,
other noise sources (like thermal noise) are relatively speaking more important, which reduces your error margins even more. So again we are
left with the multi-level DRAM cell being theoretically possible, but
now the practicality is even worse - and you probably also need to
reduce the number of bits per sense amplifier line to reduce the capacitance.
As far as I can very roughly estimate, I believe Mitch's comment that
you could put 2 bits in a single DRAM cell, but I think the
disadvantages would be more dramatic than he suggested. It does not surprise me that nobody makes multi-level DRAM cells (AFAIK).
In article <[email protected]>, [email protected] (Anton Ertl) wrote:
---------------------
Actually, other architectures also added prefetching instructions
for dealing with that problem. All I have read about that was that
there were many disappointments when using these instructions.
I don't know if there were any successes, and how frequent they
were compared to the disappointments.
I have never encountered any successes, and given how keen Intel were on their x86 version of this, and my employers' relationship with them at
the time, I would expect to have heard about them. My own experience was disappointing, with minor speedups and slowdowns. My best hypothesis was
that the larger code size worsened cache effects enough to cancel out any gains from the prefetches.
So I don't see that IA-64 was any different from other architectures
in that respect.
Two points on that:
On 08/03/2026 17:53, EricP wrote:
Stefan Monnier wrote:
As I say, I am not a chip designer. But I think that while you could >>> make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
According to https://en.wikipedia.org/wiki/Multi-level_cell:
In 1997, NEC demonstrated a dynamic random-access memory (DRAM) >> chip with quad-level cells, holding a capacity of 4 Gbit.
So apparently it's been done. I can't find any reference for that
claim, tho. Has anyone heard of it?
Comparing the situation of DRAM vs flash storage, I notice that MLC
cells in flash storage sometimes read the data multiple times to get
a more precise reading of the voltage level (IIUC). DRAM doesn't really >> have that option.
=== Stefan
A quicky search for "multi-bit" "dram" finds a recent paper:
IGZO 2T0C DRAM With VTH Compensation Technique
for Multi-Bit Applications, 2025 https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf
which demonstrates 3 bits per cell but in just 25 cells.
"we proposed and experimentally demonstrated the novel dual-gate (DG) indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C) dynamic random-access memory (DRAM) for array-level multi-bit storage.
...
the optimized transistors... enable long retention time (>1500 s)
and ultra-fast writing speed (< 10 ns).
...
non-overlap 3-bit storage operation among 25 cells is achieved
...
Recent research efforts have mostly focused on developing capacitor-less two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural innovation primarily aims to save the large space occupied by storage capacitor in conventional one-transistor-one-capacitor (1T1C) design
...
Compared with 1T1C DRAM, the read operation of 2T0C DRAM is non-destructive, which enables multi-bit storage in bit-cell"
If you eliminate the storage capacitor, does that not also eliminate the need for refresh?
And then you have SRAM rather than DRAM? Or does the transistor pair still leak charge?
On 08/03/2026 17:53, EricP wrote:
Stefan Monnier wrote:If you eliminate the storage capacitor, does that not also eliminate the need for refresh? And then you have SRAM rather than DRAM? Or does the transistor pair still leak charge?
As I say, I am not a chip designer. But I think that while you could
make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
According to https://en.wikipedia.org/wiki/Multi-level_cell:
In 1997, NEC demonstrated a dynamic random-access memory (DRAM)
chip with quad-level cells, holding a capacity of 4 Gbit.
So apparently it's been done. I can't find any reference for that
claim, tho. Has anyone heard of it?
Comparing the situation of DRAM vs flash storage, I notice that MLC
cells in flash storage sometimes read the data multiple times to get
a more precise reading of the voltage level (IIUC). DRAM doesn't really >>> have that option.
=== Stefan
A quicky search for "multi-bit" "dram" finds a recent paper:
IGZO 2T0C DRAM With VTH Compensation Technique
for Multi-Bit Applications, 2025
https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf
which demonstrates 3 bits per cell but in just 25 cells.
"we proposed and experimentally demonstrated the novel dual-gate (DG)
indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C)
dynamic random-access memory (DRAM) for array-level multi-bit storage.
...
the optimized transistors... enable long retention time (>1500 s)
and ultra-fast writing speed (< 10 ns).
...
non-overlap 3-bit storage operation among 25 cells is achieved
...
Recent research efforts have mostly focused on developing capacitor-less
two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural
innovation primarily aims to save the large space occupied by storage
capacitor in conventional one-transistor-one-capacitor (1T1C) design
...
Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
non-destructive, which enables multi-bit storage in bit-cell"
Robert Finch <[email protected]> posted:
Tonight’s tradeoff was having the memory page size determined by the
root pointer. A few bits (5) in a root pointer could be used to set the
page size. All references through that root pointer would then use the
specified page size. When the root pointer changes, the page size goes
along with it.
My 66000 uses a 3-bit level indicator in every page-table-pointer and PTE. The root pointer LVL determines the size of the VAS
The PTP LVLs determine the size of pages on the level being accessed.
The PTE LVL = 001.
The number of address bits that come from PTE and from VA are determined
by the LVL of the PTP that pointed to this translating PTE That is the previous PTP.
I use 8KB pages, so the list goes 8KB, 8MB, 8GB, 8TB, 8EB, 8PB, really-big. And each page provides 1024 (freely mixed) entries.
This scheme allows for level skipping at the top, in the middle, and at
the bottom (super-pages). Unused levels are checked for canonicality.
I think not flushing the TLB could be got away with, with ASID matching
on the entries. For a given ASID the page size would be consistent with
the root pointer.
This is what ASIDs are for.
Alternately the TLB entry could be tagged with the root pointer register
number, so if a different root pointer register is used the entry would
not match.
ASIDs, tag everything with ASIDs, and provide an INVAL-ASID instruction.
I have been studying the 68851 MMU. Quite complex compared to some otherEven Moto figured out -851 was too far over the top, use -030 MMU instead.
MMUs. I will likely have a 68851 compatible MMU for my 68000 project though. >>
[email protected] (John Dallman) posted:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
---------------------
Actually, other architectures also added prefetching instructions
for dealing with that problem. All I have read about that was that
there were many disappointments when using these instructions.
I don't know if there were any successes, and how frequent they
were compared to the disappointments.
I have never encountered any successes, and given how keen Intel were on
their x86 version of this, and my employers' relationship with them at
the time, I would expect to have heard about them. My own experience was
disappointing, with minor speedups and slowdowns. My best hypothesis was
that the larger code size worsened cache effects enough to cancel out any
gains from the prefetches.
So I don't see that IA-64 was any different from other architectures
in that respect.
Two points on that:
While I have, personally, added prefetch SW instructions and HW prefetchers, these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.
David Brown <[email protected]> posted:
On 08/03/2026 17:53, EricP wrote:
Stefan Monnier wrote:If you eliminate the storage capacitor, does that not also eliminate the
As I say, I am not a chip designer. But I think that while you could >>>>> make DRAM cells with 2 bits per cell, the cells would be more than
twice the size as well as several times slower and demanding much
more power.
According to https://en.wikipedia.org/wiki/Multi-level_cell:
In 1997, NEC demonstrated a dynamic random-access memory (DRAM) >>>> chip with quad-level cells, holding a capacity of 4 Gbit.
So apparently it's been done. I can't find any reference for that
claim, tho. Has anyone heard of it?
Comparing the situation of DRAM vs flash storage, I notice that MLC
cells in flash storage sometimes read the data multiple times to get
a more precise reading of the voltage level (IIUC). DRAM doesn't really >>>> have that option.
=== Stefan
A quicky search for "multi-bit" "dram" finds a recent paper:
IGZO 2T0C DRAM With VTH Compensation Technique
for Multi-Bit Applications, 2025
https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf
which demonstrates 3 bits per cell but in just 25 cells.
"we proposed and experimentally demonstrated the novel dual-gate (DG)
indium-gallium-zinc oxide (IGZO) two-transistor-zero-capacitance (2T0C)
dynamic random-access memory (DRAM) for array-level multi-bit storage.
...
the optimized transistors... enable long retention time (>1500 s)
and ultra-fast writing speed (< 10 ns).
...
non-overlap 3-bit storage operation among 25 cells is achieved
...
Recent research efforts have mostly focused on developing capacitor-less >>> two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural >>> innovation primarily aims to save the large space occupied by storage
capacitor in conventional one-transistor-one-capacitor (1T1C) design
...
Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
non-destructive, which enables multi-bit storage in bit-cell"
need for refresh?
If you eliminate the storage capacitor, you eliminate the ability to store charge. Storing charge is the ONLY thing a DRAM cell does.
And then you have SRAM rather than DRAM? Or does the
transistor pair still leak charge?
Every thing leaks, the capacitor is there so that the stored value can
be retained "for at least a while"
I *am* skeptical that supporting page-crossing (or even block- crossing) accesses is important enough to justify a lot of complexity and extra hardware,
Tonight’s tradeoff was having the memory page size determined by the
root pointer. A few bits (5) in a root pointer could be used to set the
page size. All references through that root pointer would then use the specified page size. When the root pointer changes, the page size goes
along with it.
I think not flushing the TLB could be got away with, with ASID matching
on the entries. For a given ASID the page size would be consistent with
the root pointer.
Alternately the TLB entry could be tagged with the root pointer register number, so if a different root pointer register is used the entry would
not match.
I have been studying the 68851 MMU. Quite complex compared to some other MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.
Robert Finch <[email protected]> wrote:
Tonight’s tradeoff was having the memory page size determined by the
root pointer. A few bits (5) in a root pointer could be used to set the
page size. All references through that root pointer would then use the
specified page size. When the root pointer changes, the page size goes
along with it.
I think not flushing the TLB could be got away with, with ASID matching
on the entries. For a given ASID the page size would be consistent with
the root pointer.
Alternately the TLB entry could be tagged with the root pointer register
number, so if a different root pointer register is used the entry would
not match.
I have been studying the 68851 MMU. Quite complex compared to some other
MMUs. I will likely have a 68851 compatible MMU for my 68000 project though.
The 68851 MMU was such a disaster that never worked right that Motorola
only used a subset of the design inside the 68020. Or so I had heard.
Spent hours going over the design myself, never having looked at a MMU before, thought they were insane.
SUN hated the 68851 and never used it, until the 68030 legacy system as
Sparc came out.
A write up on the differences might be interesting.
On 08/03/2026 17:53, EricP wrote:...
IGZO 2T0C DRAM With VTH Compensation Technique
for Multi-Bit Applications, 2025
https://ieeexplore.ieee.org/iel8/6245494/6423298/10979978.pdf
Recent research efforts have mostly focused on developing capacitor-lessIf you eliminate the storage capacitor, does that not also eliminate the >need for refresh? And then you have SRAM rather than DRAM?
two-transistor-zero-capacitance (2T0C) DRAM bit-cells. This architectural
innovation primarily aims to save the large space occupied by storage
capacitor in conventional one-transistor-one-capacitor (1T1C) design
...
Compared with 1T1C DRAM, the read operation of 2T0C DRAM is
non-destructive, which enables multi-bit storage in bit-cell"
Or does the
transistor pair still leak charge?
On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:
I *am* skeptical that supporting page-crossing (or even block- crossing)I am not. If an architecture is defined so that programmers are not
accesses is important enough to justify a lot of complexity and extra
hardware,
expected to know how big pages are on any given implementation, then locks have to just work always.
[email protected] (John Dallman) posted:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
---------------------
Actually, other architectures also added prefetching instructionsI have never encountered any successes, and given how keen Intel were on
for dealing with that problem. All I have read about that was that
there were many disappointments when using these instructions.
I don't know if there were any successes, and how frequent they
were compared to the disappointments.
their x86 version of this, and my employers' relationship with them at
the time, I would expect to have heard about them. My own experience was
disappointing, with minor speedups and slowdowns. My best hypothesis was
that the larger code size worsened cache effects enough to cancel out any
gains from the prefetches.
So I don't see that IA-64 was any different from other architecturesTwo points on that:
in that respect.
While I have, personally, added prefetch SW instructions and HW prefetchers, these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.
MitchAlsup wrote:
[email protected] (John Dallman) posted:
In article <[email protected]>,
[email protected] (Anton Ertl) wrote:
---------------------
Actually, other architectures also added prefetching instructionsI have never encountered any successes, and given how keen Intel were on >> their x86 version of this, and my employers' relationship with them at
for dealing with that problem. All I have read about that was that
there were many disappointments when using these instructions.
I don't know if there were any successes, and how frequent they
were compared to the disappointments.
the time, I would expect to have heard about them. My own experience was >> disappointing, with minor speedups and slowdowns. My best hypothesis was >> that the larger code size worsened cache effects enough to cancel out any >> gains from the prefetches.
So I don't see that IA-64 was any different from other architecturesTwo points on that:
in that respect.
While I have, personally, added prefetch SW instructions and HW prefetchers,
these tend to add performance rather sporadically, and seldom add "enough" performance to justify taking up 'that much' of ISA or designer time.
One area I think might be a benefit is to prefetch VA translations
for instructions and data. These can be prefetched just into cache,
or into I- and D- TLB's.
I had the idea in 2010 while looking at locking and hardware transactions.
If a memory section is guarded by a mutex, I don't want to prefetch
the data as that could yank ownership away from the current mutex holder.
What I might do is prefetch the translation PTE's for the data locations
so that when I am granted mutex ownership that I minimize the time
it is held by not waiting for cold memory table walks.
I also optionally might like to be able to trigger advance page faults
on data but without actually touching the data page such that it moves
cache line ownership. This could save me from taking a page fault on a
shared memory section while holding the guard mutex.
Prefetching the VA translates for instructions could be a good
tradeoff for alternate paths and just load the PTE's into cache,
as opposed to loading the alternate code into cache.
The VA translate prefetch instructions would need options to control
which cache and I- and D- TLB level the PTE's are prefetched into.
There is one very reasonable use case: testing a random number generator.
A useful test is to ensure numbers are uncorrelated, so you get 3 random numbers called A, B, C, and you look up A*N*N + B*N + C to count the number of times you see A followed by B followed by C, where N is the range of
the random value, say, 0 - 1023. This would be an array of 1 billion 32-bit
values. You get 1000 billion random numbers, and then look through to make sure most buckets have a value around 1000. Any buckets less than 500 or more than 1500 might be considered a random number generator failure.
This is a useful test since it intuitively makes sense--if some patterns are too likely (or unlikely), then you know you have a problem with your
"random" numbers.
Another use case would be an algorithm which wants to shuffle a large
array (say, you want to create test cases for a sorting algorithm). I
think most shuffling algorithms which are fair will randomly index into
the array, and each of these will be a cache miss.
I thought "natural alignment" is the standard solution to this problem. >Avoids unaligned accesses at all levels, regardless of the sizes ofI *am* skeptical that supporting page-crossing (or even block- crossing) >>> accesses is important enough to justify a lot of complexity and extra
hardware, ...
pages or cache lines (as long as stick to powers of 2) at the cost of
adding padding, but that is usually considered as negligible.
Unaligned accesses can still be important enough for some particular >circumstances, but usually the programmers knows about it and AFAIK they >never need to be atomic.
EricP <[email protected]> posted:
MitchAlsup wrote:
[email protected] (John Dallman) posted:One area I think might be a benefit is to prefetch VA translations
In article <[email protected]>,While I have, personally, added prefetch SW instructions and HW prefetchers,
[email protected] (Anton Ertl) wrote:
---------------------
Actually, other architectures also added prefetching instructionsI have never encountered any successes, and given how keen Intel were on >>>> their x86 version of this, and my employers' relationship with them at >>>> the time, I would expect to have heard about them. My own experience was >>>> disappointing, with minor speedups and slowdowns. My best hypothesis was >>>> that the larger code size worsened cache effects enough to cancel out any >>>> gains from the prefetches.
for dealing with that problem. All I have read about that was that >>>>> there were many disappointments when using these instructions.
I don't know if there were any successes, and how frequent they
were compared to the disappointments.
So I don't see that IA-64 was any different from other architectures >>>>> in that respect.Two points on that:
these tend to add performance rather sporadically, and seldom add "enough" >>> performance to justify taking up 'that much' of ISA or designer time.
for instructions and data. These can be prefetched just into cache,
or into I- and D- TLB's.
I had the idea in 2010 while looking at locking and hardware transactions. >> If a memory section is guarded by a mutex, I don't want to prefetch
the data as that could yank ownership away from the current mutex holder.
Then you need a LD instruction that can fail and the status tested by
some other instruction. That is: code performs a LD; LD takes a miss
and leaves the CPU. Access finds cache line in modified or exclusive
state, and instead of returning the value and making line stale, it
fails. {with whatever definition you want for fail}. Since MOESI uses
3-bits, you can use an unused MOESI state to record that a failed access
has transpired--then use this to optimize downstream-cache behavior.
What I might do is prefetch the translation PTE's for the data locations
so that when I am granted mutex ownership that I minimize the time
it is held by not waiting for cold memory table walks.
This is what table-walk accelerators are for.
I also optionally might like to be able to trigger advance page faults
on data but without actually touching the data page such that it moves
cache line ownership. This could save me from taking a page fault on a
shared memory section while holding the guard mutex.
Prefetching the VA translates for instructions could be a good
tradeoff for alternate paths and just load the PTE's into cache,
as opposed to loading the alternate code into cache.
The VA translate prefetch instructions would need options to control
which cache and I- and D- TLB level the PTE's are prefetched into.
I use 5-bits for this (although in practice 3 would have been sufficient) {PRE, PUSH}×{{RWX}+{Cc}}}
where Cc tells which cache layer the data is fetched up into orpushed
back down into.
PUSH {{010}+{--}} is simple invalidate and throw away modifications.
I had the idea in 2010 while looking at locking and hardware transactions. >> If a memory section is guarded by a mutex, I don't want to prefetch
the data as that could yank ownership away from the current mutex holder.
Then you need a LD instruction that can fail and the status tested by
some other instruction. That is: code performs a LD; LD takes a miss
and leaves the CPU. Access finds cache line in modified or exclusive
state, and instead of returning the value and making line stale, it
fails. {with whatever definition you want for fail}. Since MOESI uses 3-bits, you can use an unused MOESI state to record that a failed access has transpired--then use this to optimize downstream-cache behavior.
Interesting - a load conditional on the cache line being either
- cached locally in an MOES state
- cached remotely in an S state
- uncached
Does your ESM use this approach?
What I might do is prefetch the translation PTE's for the data locations >> so that when I am granted mutex ownership that I minimize the time
it is held by not waiting for cold memory table walks.
This is what table-walk accelerators are for.
If by table walk accelerator you mean caching the interior level PTE's
on the downward walk, and if there is a PTE miss checking them in a
bottom-up table walk, then that mechanism is still there and
my PTE prefetch would make use of it.
I also optionally might like to be able to trigger advance page faults
on data but without actually touching the data page such that it moves
cache line ownership. This could save me from taking a page fault on a
shared memory section while holding the guard mutex.
Prefetching the VA translates for instructions could be a good
tradeoff for alternate paths and just load the PTE's into cache,
as opposed to loading the alternate code into cache.
The VA translate prefetch instructions would need options to control
which cache and I- and D- TLB level the PTE's are prefetched into.
I use 5-bits for this (although in practice 3 would have been sufficient) {PRE, PUSH}×{{RWX}+{Cc}}}
where Cc tells which cache layer the data is fetched up into orpushed
back down into.
PUSH {{010}+{--}} is simple invalidate and throw away modifications.
I would also have 2 or 3 cache control bits on all levels of PTE's
but I would have separate lookup tables for interior and leaf PTE's.
The tables map the cache control bits to the kind of caching the
for that table level.
quadi [2026-03-09 03:36:38] wrote:
On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:
I *am* skeptical that supporting page-crossing (or even block- crossing) >>> accesses is important enough to justify a lot of complexity and extraI am not. If an architecture is defined so that programmers are not
hardware,
expected to know how big pages are on any given implementation, then locks >> have to just work always.
I thought "natural alignment" is the standard solution to this problem. Avoids unaligned accesses at all levels, regardless of the sizes of
pages or cache lines (as long as stick to powers of 2) at the cost of
adding padding, but that is usually considered as negligible.
Unaligned accesses can still be important enough for some particular circumstances, but usually the programmers knows about it and AFAIK they never need to be atomic.
On 3/9/2026 10:05 AM, Stefan Monnier wrote:
quadi [2026-03-09 03:36:38] wrote:
On Mon, 16 Feb 2026 18:04:27 -0500, Paul Clayton wrote:
I *am* skeptical that supporting page-crossing (or even block- crossing) >>> accesses is important enough to justify a lot of complexity and extraI am not. If an architecture is defined so that programmers are not
hardware,
expected to know how big pages are on any given implementation, then locks >> have to just work always.
I thought "natural alignment" is the standard solution to this problem. Avoids unaligned accesses at all levels, regardless of the sizes of
pages or cache lines (as long as stick to powers of 2) at the cost of adding padding, but that is usually considered as negligible.
Unaligned accesses can still be important enough for some particular circumstances, but usually the programmers knows about it and AFAIK they never need to be atomic.
Yes, it makes sense and is reasonable to assume that misaligned access
can never be assumed to be atomic even on an architecture natively supporting both misaligned access and atomic memory accesses.
Well, and by extension, that an aligned access may never cross a page boundary (because it is not possible for it to do so).
According to Stefan Monnier <[email protected]>:
I thought "natural alignment" is the standard solution to this problem.I *am* skeptical that supporting page-crossing (or even block- crossing) >>>> accesses is important enough to justify a lot of complexity and extra
hardware, ...
Avoids unaligned accesses at all levels, regardless of the sizes of
pages or cache lines (as long as stick to powers of 2) at the cost of
adding padding, but that is usually considered as negligible.
That was the theory on S/360 in 1964. By 1968 they added the "byte oriented" feature to 360/85 to allow unaligned access. Many RISC chips went the same way, starting with mandatory alignment, adding unaligned access later. What's
different now?
Unaligned accesses can still be important enough for some particular
circumstances, but usually the programmers knows about it and AFAIK they
never need to be atomic.
I think that's right, atomic access is a fairly special case.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,099 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492374:00:34 |
| Calls: | 14,106 |
| Calls today: | 2 |
| Files: | 187,124 |
| D/L today: |
1,531 files (699M bytes) |
| Messages: | 2,496,031 |