Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
Sounds like [multiscalar processors](doi:multiscalar processor)^^^^^^^^^^^^^^^^^^^^^
[ I guess it can be useful to actully look at what one pasts before
pressing "send", eh? ]
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.
John Savard--- Synchronet 3.21a-Linux NewsLink 1.2
On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
On further reflection, this may be equivalent to re-inventing out-of-order >execution.
John Savard
John Savard <[email protected]d> posted:
When I saw a post about a new way to do OoO, I had thought it might be
talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
Andy Glew was working on stuff like this 10-15 years ago
MitchAlsup wrote:
John Savard <[email protected]d> posted:
When I saw a post about a new way to do OoO, I had thought it
might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core-
intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by
splitting programs into chunks that can be performed in parallel
on different cores, where the cores are intimately connected in
order to make this work.
This is a sound idea, but one may not find enough opportunities to
use it.
Although it's called "inverse hyperthreading", this technique
could be combined with SMT - put the chunks into different threads
on the same core, rather than on different cores, and then one
wouldn't need to add extra connections between cores to make it
work.
Andy Glew was working on stuff like this 10-15 years ago
That's what immediately fell to my mind as well, it looks a lot like
trying some of his ideas about scouting micro-threads, doing work in
the hope that it will turn out useful.
To me it sounds like it is related to eager execution, except
skipping further forward into upcoming code.
Terje
The question is what is most likely meaning of the fact of patenting?
IMHO, it means that they explored the idea and decided against going in
this particular direction in the near and medium-term future.
I think that when Intel actually plans to use particular idea then they
keep the idea secret for as long as they can and either don't patent at
all or apply for patent after release of the product.
I can be wrong about it.
Some of them 1 year ago gave representations
about advantages of removal of SMT.
Removal of SMT and this super-core
idea can be considered complimentary - both push into direction of
cores with smaller # of EU pipes.
Anyway, couple of months ago Tan himself said that Intel is reversing
the decision to remove SMT.
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
John Savard
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.
On 9/15/2025 6:54 PM, John Savard wrote:
When I saw a post about a new way to do OoO, I had thought it might be talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting programs into chunks that can be performed in parallel on different cores, where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it.
Although it's called "inverse hyperthreading", this technique could be combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add extra connections between cores to make it work.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.
Say, more cores and less power use, at the possible expense of some
amount of performance.
...
John Savard
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
For a later perspective, see
https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
On 9/15/2025 6:54 PM, John Savard wrote:
When I saw a post about a new way to do OoO, I had thought it might be
talking about this:
https://www.techradar.com/pro/is-it-a-bird-is-it-a-plane-no-its-super-core- >> intels-latest-patent-revives-ancient-anti-hyperthreading-cpu-technique-in- >> attempt-to-boost-processor-performance-but-will-it-be-enough
Basically, Intel proposes to boost single-thread performance by splitting
programs into chunks that can be performed in parallel on different cores, >> where the cores are intimately connected in order to make this work.
This is a sound idea, but one may not find enough opportunities to use it. >>
Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >hardware level, but then pretends to still be an x86 chip by running >*everything* in a firmware level emulator via dynamic translation.
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
According to BGB <[email protected]>:
Still sometimes it seems like it is only a matter of time until Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
John Levine <[email protected]> writes:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
It definitely was. However, even a modern high-performance OoO cores
like Apple M1-M4's P-cores or on Qualcomm's Oryon, the performance of dynamically-translated AMD64 code is usually slower than on comparable
CPUs from Intel and AMD.
- anton
BGB <[email protected]> writes:
Still sometimes it seems like it is only a matter of time until
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an x86
chip by running *everything* in a firmware level emulator via
dynamic translation.
Intel has already done so, although AFAIK not at the firmware level:
Every IA-64 CPU starting with the Itanium II did not implement IA-32
in hardware (unlike the Itanium), but instead used dynamic
translation.
There is no reason for Intel to repeat this mistake, or for anyone
else to go there, either.
- anton
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or AMD releases a new CPU that just sort of jettisons x86 entirely at the hardware level, but then pretends to still be an x86 chip by running *everything* in a firmware level emulator via dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
For a later perspective, see
https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
Thomas Koenig <[email protected]> writes:
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at theFor AMD, that has happend already a few decades ago; they translate
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
x86 code into RISC-like microops.
That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in <[email protected]>:
|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions with |around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop" |standing for "RISC").
Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.
As Mitch Alsup has written, AMD has its MacroOps (load-op and RMW) in addition to the Rops. It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
From 1998. Unfortunately, there are not many more recent books about
the microarchitecture of OoO CPUs. What I have found:
Modern Processor Design: Fundamentals of Superscalar Processors
John Paul Shen, Mikko H. Lipasti
McGraw-Hill
656 pages
published 2004 or so (don't let the 2013 date from the reprint fool you) Discusses CPU design (not just OoO) using various real CPUs from the
1990s as example.
Processor Microarchitecture -- An Implementation Perspective
Antonio Gonzalez , Fernando Latorre , Grigorios Magklis
Springer
published 2010
Relatively short, discusses the various parts of an OoO CPU and how to implement them.
Henry Wong
A Superscalar Out-of-Order x86 Soft Processor for FPGA
Ph.D. thesis, U. Toronto https://www.stuffedcow.net/files/henry-thesis-phd.pdf
Slides: https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf published 2017
A problem is that the older books don't cover recent developments such
as alias prediction and that Wong was limited by what a single person
can do (his work was not part of a larger research project at
U. Toronto), as well as what fits into an FPGA.
BTW, Wong's work can be seen as a refutation of BGB's statement: He
chose to implement IA-32; on slide 14 of <https://web.stanford.edu/class/ee380/Abstracts/190605-slides.pdf> he
states "It’s easy to implement!".
- anton
Anton Ertl wrote:
Thomas Koenig <[email protected]> writes:
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time untilFor AMD, that has happend already a few decades ago; they translate
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.
x86 code into RISC-like microops.
That's nonsense; regulars of this groups should know better, at
least this nonsense has been corrected often enough. E.g., I wrote
in <[email protected]>:
|Not even if the microcode the Intel and AMD chips used was really |RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").
I don't know what you are objecting to - Intel calls its internal instructions micro-operations or uOps, and AMD calls its Rops.
On Thu, 18 Sep 2025 12:33:44 -0400
EricP <[email protected]> wrote:
Anton Ertl wrote:
Thomas Koenig <[email protected]> writes:I don't know what you are objecting to - Intel calls its internal
BGB <[email protected]> schrieb:That's nonsense; regulars of this groups should know better, at
Still sometimes it seems like it is only a matter of time untilFor AMD, that has happend already a few decades ago; they translate
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator via
dynamic translation.
x86 code into RISC-like microops.
least this nonsense has been corrected often enough. E.g., I wrote
in <[email protected]>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the "R"
of "Rop" |standing for "RISC").
instructions micro-operations or uOps, and AMD calls its Rops.
No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly because
of marketing, because RISC was considered cool.
Thomas Koenig <[email protected]> posted:
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
For AMD, that has happend already a few decades ago; they translate
x86 code into RISC-like microops.
With a very loose definition of RISC::
a)Does a RISC ISA contain memory reference address generation from
the pattern [Rbase+Rindex<<scale+Displacement] ??
Some will argue yes, others no.
b) does a RISC ISA contain memory reference instructions that are
combined with arithmetic calculations ??
Some will argue yes, others no.
c) does a RISC ISA contain memory reference instructions that
access memory twice ??? LD-OP-ST :: but the TLB only once ?!?
Most would argue no.
Yet, this is the µISA of K7 and K8. It is only RISC in the very
loosest sense of the word.
And do not get me started on the trap/exception/interrupt model.
See "The Anatomy of a High-Performance Microprocessor: A Systems
Perspective" by Bruce Shriver and Bennett Smith.
For a later perspective, see
https://github.com/google/security-research/blob/master/pocs/cpus/entrysign/zentool/docs/reference.md
Michael S wrote:
On Thu, 18 Sep 2025 12:33:44 -0400
EricP <[email protected]> wrote:
Anton Ertl wrote:
Thomas Koenig <[email protected]> writes:I don't know what you are objecting to - Intel calls its internal
BGB <[email protected]> schrieb:That's nonsense; regulars of this groups should know better, at
Still sometimes it seems like it is only a matter of time untilFor AMD, that has happend already a few decades ago; they
Intel or AMD releases a new CPU that just sort of jettisons x86
entirely at the hardware level, but then pretends to still be an
x86 chip by running *everything* in a firmware level emulator
via dynamic translation.
translate x86 code into RISC-like microops.
least this nonsense has been corrected often enough. E.g., I
wrote in <[email protected]>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions
with |around 100bits, and the K7 has a read-write Rop (with the
"R" of "Rop" |standing for "RISC").
instructions micro-operations or uOps, and AMD calls its Rops.
No, they don't. They stopped using term Rops almost 25 years ago.
If they used it in early K7 manuals then it was due to inertia (K6
manuals copy&pasted without much of thought given) and partly
because of marketing, because RISC was considered cool.
And the fact that all the RISC processors ran rings around the CISC
ones.
So they wanted to promote that "hey, we can go fast too!"
Ok, AMD dropped the "risc" prefix 25 years ago.
That didn't change the way it works internally.
They still use the term "micro op" in the Intel and AMD Optimization
guides. It still means an micro-architecture specific internal
simple, discrete unit of execution, albeit a more complex one as
transistor budgets allow.
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <[email protected]>:
Still sometimes it seems like it is only a matter of time until Intel or >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the
hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy >efficiency or core count (and, in those days, processors were generally >single-core).
Now we have a different situation:
Moore's law is dying off;
Scalar CPU performance has hit a plateau;
And, for many uses, performance is "good enough";
A lot more software can make use of multi-threading;
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better >"performance per watt" metric.
So, one possibility could be, rather than a small number of big/fast
cores (either VLIW or OoO), possibly a larger number of smaller cores.
The cores could maybe be LIW or in-order RISC.
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >slower, it may still win in Perf/W and similar...
recent proposals for indexed load/store and auto-increment popping up,
I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.
The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).
Anton Ertl wrote:
Thomas Koenig <[email protected]> writes:
BGB <[email protected]> schrieb:
Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by runningFor AMD, that has happend already a few decades ago; they translate
*everything* in a firmware level emulator via dynamic translation.
x86 code into RISC-like microops.
That's nonsense; regulars of this groups should know better, at least
this nonsense has been corrected often enough. E.g., I wrote in
<[email protected]>:
|Not even if the microcode the Intel and AMD chips used was really
|RISC-like, which it was not (IIRC the P6 uses micro-instructions with
|around 100bits, and the K7 has a read-write Rop (with the "R" of "Rop"
|standing for "RISC").
I don't know what you are objecting to
The number of bits has nothing to do with what it is called.
If this uOp was for a ROB style design where all the knowledge about
each instruction including register ids, immediate data,
scheduling info, result data, status, is stored in a single ROB entry,
then 100 bits sounds pretty small so I'm guessing that was a 32-bit cpu.
Another difference is that the OoO engine that sees the uOps performs
only a very small part of the functionality of branches, with the
majority performed by the front end. I.e., there is no branching in
the OoO engine that sees the uOps, at the most it confirms the branch
prediction, or diagnoses a misprediction, at which point the OoO
engine is out of a job and has to wait for the front end; possibly
only the ROB (which deals with instructions again) resolves the
misprediction and kicks the front end into action, however.
And a uOp triggers that action sequence.
I don't see the distinction you are trying to make.
It's not entirely clear which parts of the
engine see MacroOps and ROPs, but my impression was that the MacroOps
are not split into ROPs for the largest part of the OoO engine.
AMD explains there terminology here but note that the relationship
between Macro-Ops and Micro-Ops is micro-architecture specific.
A Seventh-Generation x86 Microprocessor, 1999 >https://www.academia.edu/download/70925991/4.79985120211001-19357-4pufup.pdf
"An [micro-]OP is the minimum executable entity understood by the machine."
A macro-op is a bundle of 1 to 3 micro-ops.
Simple instructions map to 1 macro and 1-3 micro ops
and this mapping is done in the decoder.
Complex instructions map to one or more "micro-lines" each of which
consists of 3 macro-ops (of 1-3 micro-ops each) pulled from micro-code ROM.
This is a bit introductory level:
Book
Computer Organization and Design
The Hardware/Software Interface: RISC-V Edition, 2018
Patterson, Hennessy
EricP <[email protected]> writes:-------------------------------
Anton Ertl wrote:
Thomas Koenig <[email protected]> writes:
BGB <[email protected]> schrieb:
Yes, so much is clear. It's not clear where Macro-Ops are in play and
where Micro-Ops are in play. Over time I get the impression that the macro-ops are the main thing running through the OoO engine, and
Micro-Ops are only used in specific places, but it's completely
unclear to me where. E.g., if they let an RMW Macro-Op run through
the OoO engine, it would first go to the LSU for the address
generation, translation and load, then to the ALU for the
modification, then to the LSU for the store, and then to the ROB.
Where in this whole process is a Micro-Op actually stored?
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
BGB <[email protected]> writes:--------------------------------------
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <[email protected]>:
I have thought about why the idea of more smaller cores has not been
more successful, at least for the kinds of loads where you have a
large number of independent and individually not particularly
demanding threads, as in web shops. My explanation is that you need
1) memory bandwidth and 2) interconnection with the rest of the
system.
The interconnection with the rest of the system probably does
not get much cheaper for the smaller cores, and probably becomes more expensive with more cores (e.g., Intel switched from a ring to a grid
when they increased the cores in their server chips).
The bandwidth requirements to main memory for given cache sizes per
core reduce linearly with the performance of the cores; if the larger
number of smaller cores really leads to increased aggregate
performance, additional main memory bandwidth is needed, or you can compensate for that with larger caches.
But to eliminate some variables, let's just consider the case where we
want to get the same performance with the same main memory bandwidth
from using more smaller cores than we use now. Will the resulting CPU require less area? The cache sizes per core are not reduced, and
their area is not reduced much.
The core itself will get smaller, and
its performance will also get smaller (although by less than the
core).
But if you sum up the total per-core area (core, caches, and interconnect), at some point the per-core area reduces by less than
the per-core performance, so for a given amount of total performance,
the area goes up.
There is one counterargument to these considerations: The largest configuration of Turin dense has less cache for more cores than the--- Synchronet 3.21a-Linux NewsLink 1.2
largest configuration of Turin. I expect that's the reason why they
offer both; if you have less memory-intensive loads, Turin dense with
the additional cores will give you more performance, otherwise you
better buy Turin.
Also, Intel has added 16 E-Cores to their desktop chips without giving
them the same amount of caches as the P-Cores; e.g., in Arrow lake we
have
P-core 48KB D-L0 64KB I-L1 192KB D-L1 3MB L2 3MB L3/core
E-Core 32KB D-L1 64KB I-L1 4MB L2/4 cores 3MB L3/4cores
Here we don't have an alternative with more P-Cores and the same
bandwidth, so we cannot contrast the approaches. But it's certainly
the case that if you have a bandwidth-hungry load, you don't need to
buy the Arrow Lake with the largest number of E-Cores.
- anton
BGB <[email protected]> writes:
Like, most of the ARM chips don't exactly have like a 150W TDP or similar...
And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.
Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...
No TDP numbers are given for Oryon. For Apple's M4, the numbers are
M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W
Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.
[RISC-V]
recent proposals for indexed load/store and auto-increment popping up,
Where can I read about that.
- anton
BGB <[email protected]> writes:
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <[email protected]>:
Still sometimes it seems like it is only a matter of time until Intel or >>>> AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>> hardware level, but then pretends to still be an x86 chip by running
*everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).
IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.
How should the number of cores change anything? If you cannot make single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?
Now we have a different situation:
Moore's law is dying off;
Even if that is the case, how should that change anything about the
relative merits of the two approaches?
Scalar CPU performance has hit a plateau;
True, but again, what's the relevance for the discussion at hand?
And, for many uses, performance is "good enough";
In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.
A lot more software can make use of multi-threading;
Possible, but how would it change things?
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance
on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.
Evidence?
On 9/19/2025 4:50 AM, Anton Ertl wrote:
BGB <[email protected]> writes:
On 9/17/2025 4:33 PM, John Levine wrote:
According to BGB <[email protected]>:
Still sometimes it seems like it is only a matter of time until
Intel or
AMD releases a new CPU that just sort of jettisons x86 entirely at the >>>>> hardware level, but then pretends to still be an x86 chip by running >>>>> *everything* in a firmware level emulator via dynamic translation.
That sounds a whole lot like what Transmeta did 25 years ago:
https://en.wikipedia.org/wiki/Transmeta_Crusoe
They failed but perhaps things are different now. Their
native architecture was VLIW which might have been part
of the problem.
Might be different now:
25 years ago, Moore's law was still going strong, and the general
concern was more about maximizing scalar performance rather than energy
efficiency or core count (and, in those days, processors were generally
single-core).
IA-64 CPUs were shipped until July 29, 2021, and Poulson (released
2012) has 8 cores. If IA-64 (and dynamically translating AMD64 to it)
would be a good idea nowadays, it would not have been canceled.
How should the number of cores change anything? If you cannot make
single-threaded IA-32 or AMD64 programs run at competetive speeds on
IA-64 hardware, how would that inefficiency be eliminated in
multi-threaded programs?
Now we have a different situation:
Moore's law is dying off;
Even if that is the case, how should that change anything about the
relative merits of the two approaches?
Scalar CPU performance has hit a plateau;
True, but again, what's the relevance for the discussion at hand?
And, for many uses, performance is "good enough";
In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.
Possibly, it depends.
The question is what could Intel or AMD do if the wind blew in that direction.
For the end-user, the experience is likely to look similar, so they
might not need to know/care if they are using some lower-power native
chip, or something that is internally running on a dynamic translator to some likely highly specialized ISA.
A lot more software can make use of multi-threading;
Possible, but how would it change things?
Multi-threaded software does not tend to depend as much on single-thread performance as single threaded software...
Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.
Evidence?
No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).
The RasPi basically runs circles around the Eee...
Though, no good datapoints for fast x86 emulators here.
At least DOSBox and QEMU running x86 on RasPi tend to be dead slow.
( no time right now, so skipping rest )
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,070 |
Nodes: | 10 (0 / 10) |
Uptime: | 204:13:54 |
Calls: | 13,736 |
Calls today: | 1 |
Files: | 186,968 |
D/L today: |
3,109 files (997M bytes) |
Messages: | 2,420,058 |